Hi guys
can someone test this string comparation function i made ? I have no idea how fast/slow is it.
In theory it should be twice as fast as strcmp from msvcrt, but i´m unable to benchmark it.
UPDATED VERSION 29/01/2013
Proc FastStringCompare:
Arguments @String1, @String2
Uses esi, ecx, ebx, edx
mov esi D@String1
mov ecx 0
mov eax 0
mov ebx D@String2
mov edx D$esi
cmp edx D$ebx ; just to force the zero flags. if we have a match zero flag is settled. But we need to check if the last byte at string1 is zero.
jz L0> ; The 1st 4 bytes are not zero, jmp over
mov ecx 3
shr edx 8 | jz L1> ; If string is only 1 byte len. ecx = 3
inc ecx | jmp L1> ; else, ecx = 4
L0:
test dh dh | jz L1> ; the last byte of the 1st string is zero. We reached the end of the search. Contunue computing the remaining bytes at "L1"
lea ecx D$ecx+4 ; compute the len of the string1
mov edx D$esi+ecx
cmp D$ebx+ecx edx
jz L0<
jmp L3>> ; strings definitelly dont match
L1:
lea ebx D$ebx+ecx | mov edx D$ebx-3
lea esi D$esi+ecx
cmp D$esi-3 edx; if both words on string1 and string2 don´ match jmp over. Otherwise, continue analysing the remaining bytes
jnz L3>>
; We are almost there. If ecx = 0, means that the string1 have only 2 byte len. So let´t search for the 2nd byte of string2
;;
; so far all bytes are equal. Let´t see if teh string2 is null terminated as it is on string1. Ex: str1) "guga1" | str2) "guga1" Or if it is str1) "guga1" | str2) "gugaM"...
; On the dword, we need to check the byte after the dx like: 00XX_0000
If B$ebx-1 = 0
inc eax
End_If
Note for me, i commented it out, because the above assumption is wrong. In fact, when we reach here we have analysed all the string
and both are exactly the same.
;;
; In mine tests no matter what bytes gets here, because it always means that we fully analysed the strings, and they always matches
inc eax
L3:
EndP
Tested strings:
call FastStringCompare {"1", 0}, {"g", 0}
call FastStringCompare {"g", 0}, {"g", 0}
call FastStringCompare {"gu", 0}, {"gu", 0}
call FastStringCompare {"gug", 0}, {"gug", 0}
call FastStringCompare {"guga", 0}, {"guga", 0}
call FastStringCompare {"guga1", 0}, {"guga1", 0}
call FastStringCompare {"guga12", 0}, {"guga12", 0}
call FastStringCompare {"guga123", 0}, {"guga123", 0}
call FastStringCompare {"guga1234", 0}, {"guga1234", 0}
call FastStringCompare {"guga12345", 0}, {"guga12345", 0}
call FastStringCompare {"guga123456", 0}, {"guga123456", 0}
call FastStringCompare {"1234", 0}, {"5678", 0}
call FastStringCompare {"1", 0}, {"g", 0}
call FastStringCompare {"gustavo trigu", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"g", 0}, {"gu", 0}
call FastStringCompare {"gu", 0}, {"gu", 0}
call FastStringCompare {"gu", 0}, {"guu", 0}
call FastStringCompare {"guuu", 0}, {"guu", 0}
call FastStringCompare {"guu", 0}, {"guuu", 0}
call FastStringCompare {"gug", 0}, {"guga", 0}
call FastStringCompare {"g44564", 0}, {"112345g", 0}
call FastStringCompare {"gu", 0}, {"g", 0}
call FastStringCompare {"gu", 0}, {"gu", 0}
call FastStringCompare {"g", 0}, {"gu", 0}
call FastStringCompare {"g", 0}, {"g", 0}
call FastStringCompare {"1", 0}, {"g", 0}
call FastStringCompare {"g1stavo trigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gu1tavo trigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gus1avo trigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gust1vo trigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gusta1o trigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustav1 trigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo1trigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo 1rigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e guilherm1", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e guilher1e", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e guilhe1me", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e guilh1rme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e guil1erme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e gui1herme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e gu1lherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e g1ilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e 1uilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e1guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros 1 guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros1e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiro1 e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueir1s e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"guga", 0}, {"guga1", 0}
call FastStringCompare {"guga1", 0}, {"guga", 0}
________________________________________________________________________________________________
________________________________________________________________________________________________
________________________________________________________________________________________________
________________________________________________________________________________________________
________________________________________________________________________________________________
OLDER VERSION 29/01/2013 - Aligned data. It is fast, but the above one is faster.
Proc FastStringCompare:
Arguments @String1, @String2
Uses esi, ecx, ebx, edx
mov esi D@String1
mov ecx 0
mov eax 0
mov ebx D@String2
mov edx D$esi
cmp edx D$ebx ; just to force the zero flags. if we have a match zero flag is settled. But we need to check if the last byte at string1 is zero.
jz L0> ; The 1st 4 bytes are not zero, jmp over
mov ecx 3
shr edx 8 | jz L1> ; If string is only 1 byte len. ecx = 3
inc ecx | jmp L1> ; else, ecx = 4
L0:
cmp dh 0 | jz L1> ; the last byte of the 1st string is zero. We reached the end of the search. Contunue computing the remaining bytes at "L1"
lea ecx D$ecx+4 ; compute the len of the string1
mov edx D$esi+ecx
cmp D$ebx+ecx edx
jz L0<
jmp L3>> ; strings definitelly dont match
L1:
lea ebx D$ebx+ecx | mov edx D$ebx-3
lea esi D$esi+ecx
cmp edx D$esi-3 ; if both words on string1 and string2 don´ match jmp over. Otherwise, continue analysing the remaining bytes
jnz L3>>
; We are almost there. If ecx = 0, means that the string1 have only 2 byte len. So let´t search for the 2nd byte of string2
;;
; so far all bytes are equal. Let´t see if teh string2 is null terminated as it is on string1. Ex: str1) "guga1" | str2) "guga1" Or if it is str1) "guga1" | str2) "gugaM"...
; On the dword, we need to check the byte after the dx like: 00XX_0000
If B$ebx-1 = 0
inc eax
End_If
Note for me, i commented it out, because the above assumption is wrong. In fact, when we reach here we have analysed all the string
and both are exactly the same.
;;
; In mine tests no matter what bytes gets here, because it always means that we fully analysed the strings, and they always matches
inc eax
L3:
EndP
OLDER VERSION 28/01/2013 - Unaligned data and some mistake when the string is only 1 byte len
; Return values:
; Eax = 0. Strings don´t match
; Eax = 1. Strings matches
Proc FastStringCompare:
Arguments @String1, @String2
Uses esi, ecx, ebx, edx
mov esi D@String1
xor ecx ecx
xor eax eax
xor edx edx
mov ebx D@String2
mov ax W$ebx
mov dx W$esi
xor ax dx ; just to force the zero flags. if we have a match zero flag is settled. But we need to check if the last byte at string1 is zero.
L0:
or dh 0 | jz L1> ; the last byte of the 1st string is zero. We reached the end of the search. Contunue computing the remaining bytes at "L1"
lea ecx D$ecx+2 ; compute the len of the string1
mov dx W$esi+ecx
cmp W$ebx+ecx dx
jz L0<
L1:
lea ebx D$ebx+ecx | mov ax W$ebx-1
lea esi D$esi+ecx | mov dx W$esi-1
xor ax dx ; if both words on string1 and string2 don´ match jmp over. Otherwise, continue analysing the remaining bytes
jz L1> | L2: | xor eax eax | ExitP | L1:
; We are almost there. If ecx = 0, means that the string1 have only 2 byte len. So let´t search for the 2nd byte of string2
L1:
; do again for the last bytes
; so far all bytes are equal
mov eax &TRUE
.If ecx = 0 ; when len (ecx) is zero it means that we the string is smaller then 2 bytes (including null terminated)
If B$ebx+1 <> 0 ; so all we need is analyse if the last byte if the string2 is zero (Because the 1st one of both matches)
xor eax eax
End_If
.Else_If B$ebx <> 0
xor eax eax
.End_If
EndP
Code examples
call FastStringCompare {"g", 0}, {"gu", 0}
call FastStringCompare {"gu", 0}, {"gu", 0}
call FastStringCompare {"gu", 0}, {"guu", 0}
call FastStringCompare {"guuu", 0}, {"guu", 0}
call FastStringCompare {"guu", 0}, {"guuu", 0}
call FastStringCompare {"gug", 0}, {"guga", 0}
call FastStringCompare {"g44564", 0}, {"112345g", 0}
call FastStringCompare {"gu", 0}, {"g", 0}
call FastStringCompare {"gu", 0}, {"gu", 0}
call FastStringCompare {"g", 0}, {"gu", 0}
call FastStringCompare {"g", 0}, {"g", 0}
mov ecx 0-1
Do
call FastStringCompare {"gustavo trigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
dec ecx
Loop_Until ecx = 0
call FastStringCompare {"guga", 0}, {"guga1", 0}
call FastStringCompare {"guga1", 0}, {"guga", 0}
Judging from the function test results, my translation to MASM code is apparently correct.
;==============================================================================
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
;==============================================================================
.data
str1 db "The MASM32 SDK has a large body of example code that deals",0
str2 db "The MASM32 SDK has a large body of example code that deal",0
.code
;==============================================================================
FastStringCompare proc uses esi ecx ebx edx string1:DWORD, string2:DWORD
mov esi, string1
xor ecx, ecx
xor eax, eax
xor edx, edx
mov ebx, string2
mov ax, [ebx]
mov dx, [esi]
xor ax, dx ; just to force the zero flags.
; if we have a match zero flag is settled.
; But we need to check if the last byte at
; string1 is zero.
L0:
or dh, 0
jz L1 ; the last byte of the 1st string is zero.
; We reached the end of the search.
; Contunue computing the remaining bytes at "L1"
lea ecx, [ecx+2] ; compute the len of the string1
mov dx, [esi+ecx]
cmp [ebx+ecx], dx
jz L0
L1:
lea ebx, [ebx+ecx]
mov ax, [ebx-1]
lea esi, [esi+ecx]
mov dx, [esi-1]
xor ax, dx ; if both words on string1 and string2 don't
; match jmp over. Otherwise, continue analysing
;the remaining bytes
jz L2
xor eax, eax
ret
L2: ; We are almost there. If ecx = 0, means that the
; string1 have only 2 byte len. So let´t search
; for the 2nd byte of string2.
; do again for the last bytes
; so far all bytes are equal
mov eax, TRUE
.if ecx == 0 ; when len (ecx) is zero it means that we the
; string is smaller then 2 bytes (including null
; terminated).
.if byte ptr [ebx+1]
; so all we need is analyse if the last byte if
; the string2 is zero (Because the 1st one of both
; matches)
xor eax, eax
.endif
.elseif byte ptr [ebx]
xor eax, eax
.endif
ret
FastStringCompare endp
;==============================================================================
start:
;==============================================================================
invoke FastStringCompare, chr$("g"), chr$("gu")
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, chr$("gu"), chr$("gu")
printf("(should be 1)\t%d\n", eax)
invoke FastStringCompare, chr$("gu"), chr$("guu")
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, chr$("guuu"), chr$("guu")
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, chr$("guu"), chr$("guuu")
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, chr$("gug"), chr$("guga")
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, chr$("g44564"), chr$("112345g")
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, chr$("gu"), chr$("g")
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, chr$("gu"), chr$("gu")
printf("(should be 1)\t%d\n", eax)
invoke FastStringCompare, chr$("g"), chr$("gu")
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, chr$("g"), chr$("g")
printf("(should be 1)\t%d\n\n", eax)
invoke GetCurrentProcess
invoke SetProcessAffinityMask, eax, 1
invoke Sleep, 5000
REPEAT 3
counter_begin 10000000, HIGH_PRIORITY_CLASS
invoke crt_strcmp, ADDR str1, ADDR str1
counter_end
printf("%d cycles, crt_strcmp str1, str1\n", eax)
counter_begin 10000000, HIGH_PRIORITY_CLASS
invoke FastStringCompare, ADDR str1, ADDR str1
counter_end
printf("%d cycles, FastStringCompare str1, str1\n", eax)
counter_begin 10000000, HIGH_PRIORITY_CLASS
invoke crt_strcmp, ADDR str1, ADDR str2
counter_end
printf("%d cycles, crt_strcmp str1, str2\n", eax)
counter_begin 10000000, HIGH_PRIORITY_CLASS
invoke FastStringCompare, ADDR str1, ADDR str2
counter_end
printf("%d cycles, FastStringCompare str1, str2\n\n", eax)
ENDM
inkey
exit
;==============================================================================
end start
Running on a P3:
321 cycles, crt_strcmp str1, str1
274 cycles, FastStringCompare str1, str1
308 cycles, crt_strcmp str1, str2
151 cycles, FastStringCompare str1, str2
320 cycles, crt_strcmp str1, str1
274 cycles, FastStringCompare str1, str1
307 cycles, crt_strcmp str1, str2
151 cycles, FastStringCompare str1, str2
320 cycles, crt_strcmp str1, str1
274 cycles, FastStringCompare str1, str1
308 cycles, crt_strcmp str1, str2
151 cycles, FastStringCompare str1, str2
Running on a P4 Northwood:
199 cycles, crt_strcmp str1, str1
274 cycles, FastStringCompare str1, str1
197 cycles, crt_strcmp str1, str2
205 cycles, FastStringCompare str1, str2
195 cycles, crt_strcmp str1, str1
272 cycles, FastStringCompare str1, str1
194 cycles, crt_strcmp str1, str2
205 cycles, FastStringCompare str1, str2
199 cycles, crt_strcmp str1, str1
272 cycles, FastStringCompare str1, str1
185 cycles, crt_strcmp str1, str2
205 cycles, FastStringCompare str1, str2
I think eliminating the partial-register (and misaligned) accesses should improve the speed significantly.
Hi Guga,
For some times ago I was playing with string compare and tried aligning the strings.
I have no Idea how fast or "slow" :P is this algo just thought to post it for interest.
; STRCMP$(lpString1:STRING,lpString2:STRING),INT
;lpString1 < lpString2 EAX = -n : lpString1 > lpString2 EAX = +n : lpString1 = lpString2 EAX = 0
;Flags are set as well
; =============== S U B R O U T I N E =======================================
align 4
STRCMP$ proc public uses edi esi ebx String1:DWORD, String2:DWORD
mov esi, String1
mov edi, String2
cmp edi, esi
jz Exit_I
xor edx, edx
xor eax, eax
xor ecx, ecx
mov dl, 3
@@: test esi, edx
jz aligned
lodsb
mov cl, [edi]
sub eax, ecx
jnz Exit
jecxz Exit
inc edi
jmp @B
notequ: xchg eax, edx
@@: movzx eax, dl
movzx ecx, bl
sub eax, ecx
jnz Exit
jecxz Exit
shr edx, 8
shr ebx, 8
jmp @B
aligned:mov ecx, 80808080h ;ECX = 0x80808080
mov edx, 1010101h ;EDX = 0x1010101
@@: mov eax, [esi]
mov ebx, [edi]
add esi, 4
cmp eax, ebx
jnz notequ
add edi, 4
sub eax, edx
and eax, ecx
je @B
not ebx
and eax, ebx
je @B
Exit_I: xor eax, eax
Exit: ret 8
STRCMP$ endp
the problem with string comparison is one string may be aligned when the other is not
if you want maximum speed, i would think you would access both strings as aligned dwords
you may wind up writing 4 different loops to do the grunt-work
some testing may be required to find out if the in-register re-alignment
effort is actually faster than un-aligned accesses (for one of the strings)
My first test had a problem in that I failed to DWORD align my two test strings in the data section. Aligning them dropped the FastStringCompare (P3) cycle counts from ~274 and ~150 to ~154 and ~149. Then I determined that these two strings had to be the same length to get the correct comparison result. To fix this I simply made the strings the same length, but there is some problem either in the original code or in my translation. I was able to eliminate a few of the partial-register accesses, but not all of them.
;==============================================================================
include \masm32\include\masm32rt.inc
.686
include \masm32\macros\timers.asm
;==============================================================================
.data
align 4
str1 db "The MASM32 SDK has a large body of example code that deals",0
align 4
str2 db "The MASM32 SDK has a large body of example code that deal ",0
.code
;==============================================================================
FastStringCompare proc uses esi ecx ebx edx string1:DWORD, string2:DWORD
mov esi, string1
xor ecx, ecx
xor eax, eax
xor edx, edx
mov ebx, string2
;; ELIMINATED PARTIAL REGISTER ACCESSES HERE
movzx eax, WORD PTR [ebx]
movzx edx, WORD PTR [esi]
xor eax, edx ; just to force the zero flags.
; if we have a match zero flag is settled.
; But we need to check if the last byte at
; string1 is zero.
L0:
or dh, 0
jz L1 ; the last byte of the 1st string is zero.
; We reached the end of the search.
; Contunue computing the remaining bytes at "L1"
lea ecx, [ecx+2] ; compute the len of the string1
mov dx, [esi+ecx]
cmp [ebx+ecx], dx
jz L0
L1:
lea ebx, [ebx+ecx]
mov ax, [ebx-1]
lea esi, [esi+ecx]
mov dx, [esi-1]
xor ax, dx ; if both words on string1 and string2 don't
; match jmp over. Otherwise, continue analysing
; the remaining bytes
jz L2
xor eax, eax
ret
L2: ; We are almost there. If ecx = 0, means that the
; string1 have only 2 byte len. So let´t search
; for the 2nd byte of string2.
; do again for the last bytes
; so far all bytes are equal
mov eax, TRUE
.if ecx == 0 ; when len (ecx) is zero it means that we the
; string is smaller then 2 bytes (including null
; terminated).
.if byte ptr [ebx+1]
; so all we need is analyse if the last byte if
; the string2 is zero (Because the 1st one of both
; matches)
xor eax, eax
.endif
.elseif byte ptr [ebx]
xor eax, eax
.endif
ret
FastStringCompare endp
;==============================================================================
start:
;==============================================================================
invoke FastStringCompare, chr$("g"), chr$("gu")
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, chr$("gu"), chr$("gu")
printf("(should be 1)\t%d\n", eax)
invoke FastStringCompare, chr$("gu"), chr$("guu")
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, chr$("guuu"), chr$("guu")
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, chr$("guu"), chr$("guuu")
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, chr$("gug"), chr$("guga")
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, chr$("g44564"), chr$("112345g")
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, chr$("gu"), chr$("g")
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, chr$("gu"), chr$("gu")
printf("(should be 1)\t%d\n", eax)
invoke FastStringCompare, chr$("g"), chr$("gu")
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, chr$("g"), chr$("g")
printf("(should be 1)\t%d\n\n", eax)
invoke FastStringCompare, chr$("gustavo trigueiros e guilherme"),
chr$("gustavo trigueiros e guilherme")
printf("(should be 1)\t%d\n\n", eax)
invoke FastStringCompare, chr$("guga"), chr$("guga1")
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, chr$("guga1"), chr$("guga")
printf("(should be 0)\t%d\n\n", eax)
invoke FastStringCompare, ADDR str1, ADDR str1
printf("(should be 1)\t%d\n", eax)
invoke FastStringCompare, ADDR str2, ADDR str2
printf("(should be 1)\t%d\n", eax)
invoke FastStringCompare, ADDR str1, ADDR str2
printf("(should be 0)\t%d\n", eax)
invoke FastStringCompare, ADDR str2, ADDR str1
printf("(should be 0)\t%d\n\n", eax)
invoke GetCurrentProcess
invoke SetProcessAffinityMask, eax, 1
invoke Sleep, 5000
REPEAT 3
counter_begin 10000000, HIGH_PRIORITY_CLASS
invoke crt_strcmp, ADDR str1, ADDR str1
counter_end
printf("%d cycles, crt_strcmp str1, str1\n", eax)
counter_begin 10000000, HIGH_PRIORITY_CLASS
invoke FastStringCompare, ADDR str1, ADDR str1
counter_end
printf("%d cycles, FastStringCompare str1, str1\n", eax)
counter_begin 10000000, HIGH_PRIORITY_CLASS
invoke crt_strcmp, ADDR str1, ADDR str2
counter_end
printf("%d cycles, crt_strcmp str1, str2\n", eax)
counter_begin 10000000, HIGH_PRIORITY_CLASS
invoke FastStringCompare, ADDR str1, ADDR str2
counter_end
printf("%d cycles, FastStringCompare str1, str2\n\n", eax)
ENDM
inkey
exit
;==============================================================================
end start
P3:
320 cycles, crt_strcmp str1, str1
129 cycles, FastStringCompare str1, str1
307 cycles, crt_strcmp str1, str2
124 cycles, FastStringCompare str1, str2
320 cycles, crt_strcmp str1, str1
129 cycles, FastStringCompare str1, str1
307 cycles, crt_strcmp str1, str2
123 cycles, FastStringCompare str1, str2
320 cycles, crt_strcmp str1, str1
129 cycles, FastStringCompare str1, str1
307 cycles, crt_strcmp str1, str2
124 cycles, FastStringCompare str1, str2
P4 Northwood:
193 cycles, crt_strcmp str1, str1
151 cycles, FastStringCompare str1, str1
194 cycles, crt_strcmp str1, str2
147 cycles, FastStringCompare str1, str2
196 cycles, crt_strcmp str1, str1
152 cycles, FastStringCompare str1, str1
196 cycles, crt_strcmp str1, str2
152 cycles, FastStringCompare str1, str2
196 cycles, crt_strcmp str1, str1
151 cycles, FastStringCompare str1, str1
196 cycles, crt_strcmp str1, str2
149 cycles, FastStringCompare str1, str2
FastStringCompare looks competitive, but is the return value correct?
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
loop overhead is approx. 238/100 cycles
6929 cycles for 100 * crt_strcmp
7926 cycles for 100 * FastStringCompare
5713 cycles for 100 * MasmBasic StringsDiffer
6034 cycles for 100 * STRCMP$
6936 cycles for 100 * crt_strcmp
7925 cycles for 100 * FastStringCompare
5714 cycles for 100 * MasmBasic StringsDiffer
6033 cycles for 100 * STRCMP$
6930 cycles for 100 * crt_strcmp
7920 cycles for 100 * FastStringCompare
5720 cycles for 100 * MasmBasic StringsDiffer
6033 cycles for 100 * STRCMP$
6928 cycles for 100 * crt_strcmp
7924 cycles for 100 * FastStringCompare
5716 cycles for 100 * MasmBasic StringsDiffer
6034 cycles for 100 * STRCMP$
6933 cycles for 100 * crt_strcmp
7985 cycles for 100 * FastStringCompare
5717 cycles for 100 * MasmBasic StringsDiffer
6048 cycles for 100 * STRCMP$
-1 = eax crt_strcmp
0 = eax FastStringCompare
-23 = eax MasmBasic StringsDiffer
-23 = eax STRCMP$
Mine above rocks despite using some slow OP codes like "jecxz". :bgrin:
Quote
130 cycles, crt_strcmp str1, str1
116 cycles, FastStringCompare str1, str1
1 cycles, STRCMP$ str1, str1
128 cycles, crt_strcmp str1, str2
111 cycles, FastStringCompare str1, str2
51 cycles, STRCMP$ str1, str2
130 cycles, crt_strcmp str1, str1
116 cycles, FastStringCompare str1, str1
1 cycles, STRCMP$ str1, str1
127 cycles, crt_strcmp str1, str2
111 cycles, FastStringCompare str1, str2
51 cycles, STRCMP$ str1, str2
130 cycles, crt_strcmp str1, str1
116 cycles, FastStringCompare str1, str1
1 cycles, STRCMP$ str1, str1
127 cycles, crt_strcmp str1, str2
111 cycles, FastStringCompare str1, str2
50 cycles, STRCMP$ str1, str2
The CRT strcmp function returns 0 if the strings match, < 0 if string1 < string2, or > 0 if string1 is > string2, allowing it to be used directly as a comparison function for qsort and bsearch.
Quote from: Ficko on January 29, 2013, 01:46:36 AM
Mine above rocks despite using some slow OP codes like "jecxz". :bgrin:
It's pretty good for a non-SSE2 algo - I added it above :t
@Michael: I meant FastStringCompare, it returns zero for some reason - see above.
Quote from: MichaelW on January 29, 2013, 02:02:34 AM
The CRT strcmp function returns 0 if the strings match, < 0 if string1 < string2, or > 0 if string1 is > string2, allowing it to be used directly as a comparison function for qsort and bsearch.
STRCMP$ mimics this behavior as well - additionaly it set's the flags too so "eax" do not have to be checked. -
I designed it for my QSORT algo indeed. ;)
FastStringCompare is returning false because the strings do not match.
Quote from: MichaelW on January 29, 2013, 02:27:52 AM
FastStringCompare is returning false because the strings do not match.
Make them match and see what it returns.
Quote from: jj2007 on January 29, 2013, 02:49:46 AM
Quote from: MichaelW on January 29, 2013, 02:27:52 AM
FastStringCompare is returning false because the strings do not match.
Make them match and see what it returns.
In my tests the return value is correct only if the strings are aligned (I used a 4-byte alignment and did not investigate other alignments). I intended to add this information to my second post, but forgot to do so.
Strange - I get this:
align 4 ; not equal
Str1 db "gustavo trigueiros e guilhermeA", 0
align 4
Str2 db "gustavo trigueiros e guilhermeX", 0
6930 cycles for 100 * crt_strcmp
8307 cycles for 100 * FastStringCompare
4738 cycles for 100 * MasmBasic StringsDiffer
6028 cycles for 100 * STRCMP$
-1 = eax crt_strcmp
0 = eax FastStringCompare
-23 = eax MasmBasic StringsDiffer
-23 = eax STRCMP$
7910 cycles for 100 * crt_strcmp
8122 cycles for 100 * FastStringCompare
4741 cycles for 100 * MasmBasic StringsDiffer
3921 cycles for 100 * STRCMP$
align 4 ; EQUAL
Str1 db "gustavo trigueiros e guilhermeX", 0
align 4
Str2 db "gustavo trigueiros e guilhermeX", 0
0 = eax crt_strcmp
0 = eax FastStringCompare
0 = eax MasmBasic StringsDiffer
0 = eax STRCMP$
Hi guys, many tks for testing :)
Note: The return value for this function is only
&TRUE or
&FALSE
- If the strings matched it returns &TRUE
- If it fails (no matter the order of the previously good match) it always returns &FALSE
I didn´t coded it to return 0-1 (The cases that we have a partial match) because i wasn´t thinking on the qsort and bsearch functions. (I´ll see if i can implement this return, if it don´t slow down the resultant timmings)
About the alignment of the strings, indeed, i agree, it would (in theory) speed up the scanning process in 2x, approximately , but, how compilers align their strings (in 2 byte or 4) ? This is because i thought in 2 byte alignment/scanning because the minimum null terminated string to be compared is always a char+zero (Ex: "a", 0). But, what happens when we are dealing with a string inside a structure for example ? Or inside another chain of bytes (Like an array, etc), or even if this strings is in coded in the .text section in the middle of a function (Like in 3dframes.exe from masm package, for example) ?
Or even worst, what if the string ends on the end of a memory chunck ? I mean, if we have a string the looks like this
.data section --->
Guga db "Hello", 0
Virtual Memory: B$ ? ? ?
(...)
.idata section.
If the function starts comparing in a 4 byte aligned when it reach the 5th byte "o, 0", wouldn´t cause the app to crash ? (Sorry for the dumb question, but i´m not sure if
all compilers are 4 byte aligned)
Have anyone knows if all compilers align their strings (or others data) always in a 4 byte boundary ?
If it is always aligned on a 4 byte, i can try to implement a alignment strings for the comparison and see the results.
Quote from: jj2007 on January 29, 2013, 03:19:29 AM
Strange - I get this...
Yes, if I substitute your strings in my test app, then both of these comparisons return zero:
invoke FastStringCompare, ADDR str1, ADDR str1
invoke FastStringCompare, ADDR str2, ADDR str2
But if I add a space to the end of both strings, then both comparisons return 1, so the problem appears to be with the length of the strings.
And if I then comment out the align directives, both comparisons return zero. So the code is sensitive to the length and the alignment.
The above applies for my P3 and my P4.
Quote from: guga on January 29, 2013, 09:19:10 AM
Have anyone knows if all compilers align their strings (or others data) always in a 4 byte boundary?
I would guess that mainstream compilers do. The Microsoft compilers appear to maintain a minimum 4-byte alignment for any string longer than the null terminator, but I'm not sure that my test data layout is correct for what I'm trying to determine.
#include <conio.h>
#include <stdio.h>
#include <stdlib.h>
int alignment( void *address )
{
int rval;
__asm
{
xor eax, eax
mov ecx, address
bsf ecx, ecx
jz done
mov eax, 1
Shl eax, cl
done:
mov rval, eax
}
return rval;
}
char x6[6],x5[5],x4[4],x3[3],x2[2],x1[1],y1[1],y2[2],y3[3],y4[4],y5[5],y6[6];
void main()
{
char i6[6],i5[5],i4[4],i3[3],i2[2],i1[1],j1[1],j2[2],j3[3],j4[4],j5[5],j6[6];
printf("%d\n",alignment(i6));
printf("%d\n",alignment(i5));
printf("%d\n",alignment(i4));
printf("%d\n",alignment(i3));
printf("%d\n",alignment(i2));
printf("%d\n",alignment(i1));
printf("%d\n",alignment(j1));
printf("%d\n",alignment(j2));
printf("%d\n",alignment(j3));
printf("%d\n",alignment(j4));
printf("%d\n",alignment(j5));
printf("%d\n\n",alignment(j6));
printf("%d\n",alignment(x6));
printf("%d\n",alignment(x5));
printf("%d\n",alignment(x4));
printf("%d\n",alignment(x3));
printf("%d\n",alignment(x2));
printf("%d\n",alignment(x1));
printf("%d\n",alignment(y1));
printf("%d\n",alignment(y2));
printf("%d\n",alignment(y3));
printf("%d\n",alignment(y4));
printf("%d\n",alignment(y5));
printf("%d\n\n",alignment(y6));
getch();
}
8
64
16
8
4
1
1
4
8
32
4
8
16
32
4
128
4
1
2
8
4
8
8
8
; Listing generated by Microsoft (R) Optimizing Compiler Version 13.10.3077
TITLE test.c
.386P
include listing.inc
if @Version gt 510
.model FLAT
else
_TEXT SEGMENT PARA USE32 PUBLIC 'CODE'
_TEXT ENDS
_DATA SEGMENT DWORD USE32 PUBLIC 'DATA'
_DATA ENDS
CONST SEGMENT DWORD USE32 PUBLIC 'CONST'
CONST ENDS
_BSS SEGMENT DWORD USE32 PUBLIC 'BSS'
_BSS ENDS
$$SYMBOLS SEGMENT BYTE USE32 'DEBSYM'
$$SYMBOLS ENDS
_TLS SEGMENT DWORD USE32 PUBLIC 'TLS'
_TLS ENDS
FLAT GROUP _DATA, CONST, _BSS
ASSUME CS: FLAT, DS: FLAT, SS: FLAT
endif
INCLUDELIB LIBC
INCLUDELIB OLDNAMES
_DATA SEGMENT
COMM _x6:BYTE:06H
COMM _x5:BYTE:05H
COMM _x4:BYTE:04H
COMM _x3:BYTE:03H
COMM _x2:BYTE:02H
COMM _x1:BYTE
COMM _y1:BYTE
COMM _y2:BYTE:02H
COMM _y3:BYTE:03H
COMM _y4:BYTE:04H
COMM _y5:BYTE:05H
COMM _y6:BYTE:06H
$SG1250 DB '%d', 0aH, 00H
$SG1251 DB '%d', 0aH, 00H
$SG1252 DB '%d', 0aH, 00H
$SG1253 DB '%d', 0aH, 00H
$SG1254 DB '%d', 0aH, 00H
$SG1255 DB '%d', 0aH, 00H
$SG1256 DB '%d', 0aH, 00H
$SG1257 DB '%d', 0aH, 00H
$SG1258 DB '%d', 0aH, 00H
$SG1259 DB '%d', 0aH, 00H
$SG1260 DB '%d', 0aH, 00H
$SG1261 DB '%d', 0aH, 0aH, 00H
ORG $+3
$SG1262 DB '%d', 0aH, 00H
$SG1263 DB '%d', 0aH, 00H
$SG1264 DB '%d', 0aH, 00H
$SG1265 DB '%d', 0aH, 00H
$SG1266 DB '%d', 0aH, 00H
$SG1267 DB '%d', 0aH, 00H
$SG1268 DB '%d', 0aH, 00H
$SG1269 DB '%d', 0aH, 00H
$SG1270 DB '%d', 0aH, 00H
$SG1271 DB '%d', 0aH, 00H
$SG1272 DB '%d', 0aH, 00H
$SG1273 DB '%d', 0aH, 0aH, 00H
_DATA ENDS
PUBLIC _alignment
; Function compile flags: /Odt
_TEXT SEGMENT
_rval$ = -4 ; size = 4
_address$ = 8 ; size = 4
_alignment PROC NEAR
; File c:\program files\microsoft visual c++ toolkit 2003\my\alignment\test.c
; Line 6
push ebp
mov ebp, esp
push ecx
; Line 10
xor eax, eax
; Line 11
mov ecx, DWORD PTR _address$[ebp]
; Line 12
bsf ecx, ecx
; Line 13
je SHORT $done$1223
; Line 14
mov eax, 1
; Line 15
shl eax, cl
$done$1223:
; Line 17
mov DWORD PTR _rval$[ebp], eax
; Line 19
mov eax, DWORD PTR _rval$[ebp]
; Line 20
mov esp, ebp
pop ebp
ret 0
_alignment ENDP
_TEXT ENDS
PUBLIC _main
EXTRN _getch:NEAR
EXTRN _printf:NEAR
; Function compile flags: /Odt
_TEXT SEGMENT
_i2$ = -64 ; size = 2
_j6$ = -60 ; size = 6
_i4$ = -52 ; size = 4
_j1$ = -45 ; size = 1
_i3$ = -44 ; size = 3
_j2$ = -40 ; size = 2
_i5$ = -36 ; size = 5
_j3$ = -28 ; size = 3
_j5$ = -24 ; size = 5
_i1$ = -13 ; size = 1
_i6$ = -12 ; size = 6
_j4$ = -4 ; size = 4
_main PROC NEAR
; Line 25
push ebp
mov ebp, esp
sub esp, 64 ; 00000040H
; Line 28
lea eax, DWORD PTR _i6$[ebp]
push eax
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1250
call _printf
add esp, 8
; Line 29
lea ecx, DWORD PTR _i5$[ebp]
push ecx
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1251
call _printf
add esp, 8
; Line 30
lea edx, DWORD PTR _i4$[ebp]
push edx
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1252
call _printf
add esp, 8
; Line 31
lea eax, DWORD PTR _i3$[ebp]
push eax
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1253
call _printf
add esp, 8
; Line 32
lea ecx, DWORD PTR _i2$[ebp]
push ecx
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1254
call _printf
add esp, 8
; Line 33
lea edx, DWORD PTR _i1$[ebp]
push edx
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1255
call _printf
add esp, 8
; Line 34
lea eax, DWORD PTR _j1$[ebp]
push eax
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1256
call _printf
add esp, 8
; Line 35
lea ecx, DWORD PTR _j2$[ebp]
push ecx
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1257
call _printf
add esp, 8
; Line 36
lea edx, DWORD PTR _j3$[ebp]
push edx
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1258
call _printf
add esp, 8
; Line 37
lea eax, DWORD PTR _j4$[ebp]
push eax
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1259
call _printf
add esp, 8
; Line 38
lea ecx, DWORD PTR _j5$[ebp]
push ecx
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1260
call _printf
add esp, 8
; Line 39
lea edx, DWORD PTR _j6$[ebp]
push edx
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1261
call _printf
add esp, 8
; Line 42
push OFFSET FLAT:_x6
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1262
call _printf
add esp, 8
; Line 43
push OFFSET FLAT:_x5
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1263
call _printf
add esp, 8
; Line 44
push OFFSET FLAT:_x4
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1264
call _printf
add esp, 8
; Line 45
push OFFSET FLAT:_x3
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1265
call _printf
add esp, 8
; Line 46
push OFFSET FLAT:_x2
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1266
call _printf
add esp, 8
; Line 47
push OFFSET FLAT:_x1
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1267
call _printf
add esp, 8
; Line 48
push OFFSET FLAT:_y1
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1268
call _printf
add esp, 8
; Line 49
push OFFSET FLAT:_y2
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1269
call _printf
add esp, 8
; Line 50
push OFFSET FLAT:_y3
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1270
call _printf
add esp, 8
; Line 51
push OFFSET FLAT:_y4
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1271
call _printf
add esp, 8
; Line 52
push OFFSET FLAT:_y5
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1272
call _printf
add esp, 8
; Line 53
push OFFSET FLAT:_y6
call _alignment
add esp, 4
push eax
push OFFSET FLAT:$SG1273
call _printf
add esp, 8
; Line 55
call _getch
; Line 56
xor eax, eax
mov esp, ebp
pop ebp
ret 0
_main ENDP
_TEXT ENDS
END
Edit:
Analyzing this further, and noting in the listing that the compiler had changed the order of the local strings in main, but not the order of the COMM strings in the DATA section, I added this code:
printf("%d\n",&x6);
printf("%d\n",&x5);
printf("%d\n",&x4);
printf("%d\n",&x3);
printf("%d\n",&x2);
printf("%d\n",&x1);
printf("%d\n",&y1);
printf("%d\n",&y2);
printf("%d\n",&y3);
printf("%d\n",&y4);
printf("%d\n",&y5);
printf("%d\n\n",&y6);
And got this output:
4233648
4233632
4233628
4233664
4233668
4233677
4233662
4233624
4233644
4233640
4233672
4233656
While I cannot readily equate the layout of the addresses to the compiler's layout of the local strings, the data has obviously been shuffled. So the question is, who did the shuffling? And the answer from the MASM 6.0 Programmer's Guide, under Selecting Data-Sharing Methods, is:
Quote
As an alternative, you can use the COMM directive instead of PUBLIC and EXTERN. However, communal variables have some limitations. You cannot depend on their location in memory because they are allocated by the linker, and they cannot be initialized.
Ok, guys, i guess i finished with string alignment. Didn´t tested the performance. So feel free to test it.
On the strings i tested the result is accurate. (0 if don´t match, 1 if match) - If this new version is fast i´ll implement the return 0-1 on it.
Proc FastStringCompare:
Arguments @String1, @String2
Uses esi, ecx, ebx, edx
mov esi D@String1
mov ecx 0
mov eax 0
mov ebx D@String2
mov edx D$esi
cmp edx D$ebx ; just to force the zero flags. if we have a match zero flag is settled. But we need to check if the last byte at string1 is zero.
jz L0> ; The 1st 4 bytes are not zero, jmp over
mov ecx 3
shr edx 8 | jz L1> ; If string is only 1 byte len. ecx = 3
inc ecx | jmp L1> ; else, ecx = 4
L0:
cmp dh 0 | jz L1> ; the last byte of the 1st string is zero. We reached the end of the search. Contunue computing the remaining bytes at "L1"
lea ecx D$ecx+4 ; compute the len of the string1
mov edx D$esi+ecx
cmp D$ebx+ecx edx
jz L0<
jmp L3>> ; strings definitelly dont match
L1:
lea ebx D$ebx+ecx | mov edx D$ebx-3
lea esi D$esi+ecx
cmp edx D$esi-3 ; if both words on string1 and string2 don´ match jmp over. Otherwise, continue analysing the remaining bytes
jnz L3>>
; We are almost there. If ecx = 0, means that the string1 have only 2 byte len. So let´t search for the 2nd byte of string2
;;
; so far all bytes are equal. Let´t see if teh string2 is null terminated as it is on string1. Ex: str1) "guga1" | str2) "guga1" Or if it is str1) "guga1" | str2) "gugaM"...
; On the dword, we need to check the byte after the dx like: 00XX_0000
If B$ebx-1 = 0
inc eax
End_If
Note for me, i commented it out, because the above assumption is wrong. In fact, when we reach here we have analysed all the string
and both are exactly the same.
;;
; In mine tests no matter what bytes gets here, because it always means that we fully analysed the strings, and they always matches
inc eax
L3:
EndP
Tested strings:
call FastStringCompare {"1", 0}, {"g", 0}
call FastStringCompare {"g", 0}, {"g", 0}
call FastStringCompare {"gu", 0}, {"gu", 0}
call FastStringCompare {"gug", 0}, {"gug", 0}
call FastStringCompare {"guga", 0}, {"guga", 0}
call FastStringCompare {"guga1", 0}, {"guga1", 0}
call FastStringCompare {"guga12", 0}, {"guga12", 0}
call FastStringCompare {"guga123", 0}, {"guga123", 0}
call FastStringCompare {"guga1234", 0}, {"guga1234", 0}
call FastStringCompare {"guga12345", 0}, {"guga12345", 0}
call FastStringCompare {"guga123456", 0}, {"guga123456", 0}
call FastStringCompare {"1234", 0}, {"5678", 0}
call FastStringCompare {"1", 0}, {"g", 0}
call FastStringCompare {"gustavo trigu", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"g", 0}, {"gu", 0}
call FastStringCompare {"gu", 0}, {"gu", 0}
call FastStringCompare {"gu", 0}, {"guu", 0}
call FastStringCompare {"guuu", 0}, {"guu", 0}
call FastStringCompare {"guu", 0}, {"guuu", 0}
call FastStringCompare {"gug", 0}, {"guga", 0}
call FastStringCompare {"g44564", 0}, {"112345g", 0}
call FastStringCompare {"gu", 0}, {"g", 0}
call FastStringCompare {"gu", 0}, {"gu", 0}
call FastStringCompare {"g", 0}, {"gu", 0}
call FastStringCompare {"g", 0}, {"g", 0}
call FastStringCompare {"1", 0}, {"g", 0}
call FastStringCompare {"g1stavo trigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gu1tavo trigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gus1avo trigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gust1vo trigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gusta1o trigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustav1 trigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo1trigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo 1rigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e guilherm1", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e guilher1e", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e guilhe1me", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e guilh1rme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e guil1erme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e gui1herme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e gu1lherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e g1ilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e 1uilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e1guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros 1 guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros1e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiro1 e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueir1s e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"gustavo trigueiros e guilherme", 0}, {"gustavo trigueiros e guilherme", 0}
call FastStringCompare {"guga", 0}, {"guga1", 0}
call FastStringCompare {"guga1", 0}, {"guga", 0}
The preliminary results i´ve got is
Not sure if the result is accurated because i wasn´t be able to assemble the masm app from JJ2007, so i just disassembled it, inserted my code and reassembled it back. But comparing the older results i guess it is faster then my previous version.
Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz (SE4)
loop overhead is approx. 189/100 cycles
7185 cycles for 100 * crt_strcmp
2060 cycles for 100 * FastStringCompare
2058 cycles for 100 * MasmBasic StringsDiffer
3325 cycles for 100 * STRCMP$
7211 cycles for 100 * crt_strcmp
2057 cycles for 100 * FastStringCompare
2059 cycles for 100 * MasmBasic StringsDiffer
3216 cycles for 100 * STRCMP$
7177 cycles for 100 * crt_strcmp
2058 cycles for 100 * FastStringCompare
2062 cycles for 100 * MasmBasic StringsDiffer
3219 cycles for 100 * STRCMP$
7178 cycles for 100 * crt_strcmp
2057 cycles for 100 * FastStringCompare
2059 cycles for 100 * MasmBasic StringsDiffer
3309 cycles for 100 * STRCMP$
7179 cycles for 100 * crt_strcmp
2208 cycles for 100 * FastStringCompare
2073 cycles for 100 * MasmBasic StringsDiffer
3222 cycles for 100 * STRCMP$
-1 = eax crt_strcmp
0 = eax FastStringCompare
-23 = eax MasmBasic StringsDiffer
-23 = eax STRCMP$
--- ok ---
Quote from: guga on January 29, 2013, 09:19:10 AMDoes anyone know if all compilers align their strings (or other data) always on a 4 byte boundary ?
Freshly created strings are likely to be 8 byte aligned, due to HeapAlloc. But a library function must be able to handle any alignment, e.g. when you pass a pointer to a substring. There shouldn't be any problem, though, unless you use SSE2. The MasmBasic StringsDiffer() does indeed use SSE2 and has some overhead, which makes it relatively slow for very short strings.
Here are two examples for a somewhat longer string, showing the influence of alignment:
Str1 db "gustavo trigueiros e guilherme alias 'guga' is today the brain behind RosAsm, one of the most powerful assemblers ever developedA", 0
Str2 db "gustavo ...X", 0align 4 for both strings:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
++++++++++++++++++++
2894 cycles for 10 * crt_strcmp
2315 cycles for 10 * FastStringCompare
1186 cycles for 10 * MasmBasic StringsDiffer
1964 cycles for 10 * STRCMP$align 16 for both strings:
AMD Athlon(tm) Dual Core Processor 4450B (SSE3)
++++++++++++++++++++
2892 cycles for 10 * crt_strcmp
2309 cycles for 10 * FastStringCompare
969 cycles for 10 * MasmBasic StringsDiffer
1954 cycles for 10 * STRCMP$
Hi JJ, thanks for the nice words :)
I built a new version (using alignment) and just made a new improve:
Proc FastStringCompare:
Arguments @String1, @String2
Uses esi, ecx, ebx, edx
mov esi D@String1
mov ecx 0
mov eax 0
mov ebx D@String2
mov edx D$esi
cmp edx D$ebx ; just to force the zero flags. if we have a match zero flag is settled. But we need to check if the last byte at string1 is zero.
jz L0> ; The 1st 4 bytes are not zero, jmp over
mov ecx 3
shr edx 8 | jz L1> ; If string is only 1 byte len. ecx = 3
inc ecx | jmp L1> ; else, ecx = 4
L0:
test dh dh | jz L1> ; the last byte of the 1st string is zero. We reached the end of the search. Contunue computing the remaining bytes at "L1"
lea ecx D$ecx+4 ; compute the len of the string1
mov edx D$esi+ecx
cmp D$ebx+ecx edx
jz L0<
jmp L3>> ; strings definitelly dont match
L1:
lea ebx D$ebx+ecx | mov edx D$ebx-3
lea esi D$esi+ecx
cmp D$esi-3 edx; if both words on string1 and string2 don´ match jmp over. Otherwise, continue analysing the remaining bytes
jnz L3>>
; We are almost there. If ecx = 0, means that the string1 have only 2 byte len. So let´t search for the 2nd byte of string2
;;
; so far all bytes are equal. Let´t see if teh string2 is null terminated as it is on string1. Ex: str1) "guga1" | str2) "guga1" Or if it is str1) "guga1" | str2) "gugaM"...
; On the dword, we need to check the byte after the dx like: 00XX_0000
If B$ebx-1 = 0
inc eax
End_If
Note for me, i commented it out, because the above assumption is wrong. In fact, when we reach here we have analysed all the string
and both are exactly the same.
;;
; In mine tests no matter what bytes gets here, because it always means that we fully analysed the strings, and they always matches
inc eax
L3:
EndP
Using the same strings as the post above i got this results:
Str1 db "gustavo trigueiros e guilherme alias 'guga' is today the brain behind RosAsm, one of the most powerful assemblers ever developedA", 0
Str2 db "gustavo ...X", 0
Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz (£E4)
loop overhead is approx. 181/100 cycles
2103 cycles for 100 * crt_strcmp
1087 cycles for 100 * FastStringCompare
1774 cycles for 100 * MasmBasic StringsDiffer
1599 cycles for 100 * STRCMP$
2100 cycles for 100 * crt_strcmp
1091 cycles for 100 * FastStringCompare
1768 cycles for 100 * MasmBasic StringsDiffer
1593 cycles for 100 * STRCMP$
2479 cycles for 100 * crt_strcmp
1559 cycles for 100 * FastStringCompare
1849 cycles for 100 * MasmBasic StringsDiffer
1698 cycles for 100 * STRCMP$
2182 cycles for 100 * crt_strcmp
1087 cycles for 100 * FastStringCompare
1781 cycles for 100 * MasmBasic StringsDiffer
1607 cycles for 100 * STRCMP$
2109 cycles for 100 * crt_strcmp
1090 cycles for 100 * FastStringCompare
1774 cycles for 100 * MasmBasic StringsDiffer
1592 cycles for 100 * STRCMP$
1 = eax crt_strcmp
0 = eax FastStringCompare
70 = eax MasmBasic StringsDiffer
70 = eax STRCMP$
--- ok ---
And this results using
Str1 db "gustavo trigueiros e guilhermeA", 0
Str2 db "gustavo trigueiros e guilhermeX", 0
Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz (SE4)
loop overhead is approx. 181/100 cycles
7380 cycles for 100 * crt_strcmp
2048 cycles for 100 * FastStringCompare
2142 cycles for 100 * MasmBasic StringsDiffer
3322 cycles for 100 * STRCMP$
7889 cycles for 100 * crt_strcmp
2052 cycles for 100 * FastStringCompare
2136 cycles for 100 * MasmBasic StringsDiffer
3326 cycles for 100 * STRCMP$
7370 cycles for 100 * crt_strcmp
2361 cycles for 100 * FastStringCompare
2128 cycles for 100 * MasmBasic StringsDiffer
3334 cycles for 100 * STRCMP$
7378 cycles for 100 * crt_strcmp
2052 cycles for 100 * FastStringCompare
2132 cycles for 100 * MasmBasic StringsDiffer
3708 cycles for 100 * STRCMP$
7392 cycles for 100 * crt_strcmp
2061 cycles for 100 * FastStringCompare
2106 cycles for 100 * MasmBasic StringsDiffer
3306 cycles for 100 * STRCMP$
-1 = eax crt_strcmp
0 = eax FastStringCompare
-23 = eax MasmBasic StringsDiffer
-23 = eax STRCMP$
--- ok ---
I´m not sure if it can be even faster, but i´ll give it a try. The important (at least for me) is not only speed, but accuracy in the results.
Quote
Freshly created strings are likely to be 8 byte aligned, due to HeapAlloc. But a library function must be able to handle any alignment, e.g. when you pass a pointer to a substring.
That's correct. Handling unaligned data is very important.
From Guga:
Quote
Or even worst, what if the string ends on the end of a memory chunck ?
That's right though very unlikely it can happen.
If you request 1 byte trough "HeapAlloc" you will get at least 4 bytes back - if not 8 - but it is not documented by MS as such.
So what I am doing is to wrap the "HeapAlloc" API and add 3 bytes to any heap request if using 32-bit memory access or add 16 bytes if using SSE.
It is a good practice if you use SSE a lot.
Like some of us do.
I use almost exclusivelly SSE for strlen and strcpy because they are so much faster.
Like Dave sad above somewhere it is not allways possible to have both string compared as aligned.
What I am doing it to access at least one string as aligned and the other one by chance. So I get at least 50% alignment on both strings - worst case scenario -
And BTW the above test would be more accurate if it would generate a table of random lenght and aligned strings and use that as input. IMO
yah - you can't depend on the strings having any particular alignment
probably the best approach is to test all the alignment combinations in each pass of the timing test
then divide the result by the number of tests
that way, you optimize your code for overall performance
ALIGN 4
szTest0 db 'String',0
szTestA db 'Strinx',0
ALIGN 4
db 0
szTest1 db 'String',0
szTestB db 'Strinx',0
ALIGN 4
db 0,0
szTest2 db 'String',0
szTestC db 'Strinx',0
ALIGN 4
db 0,0,0
szTest3 db 'String',0
szTestD db 'Strinx',0
compilers may try to align strings
but, they often wind up being partials of some previous string
or in a buffer that is not aligned, etc
or - perhaps you are comparing strings, starting at some unaligned offset
strings are byte entities in a dword world :P
because aligned SSE is probably the fastest way, it's a 16-byte world
On the updated version, the alignment really dont seems to be any problem. I tried to updated it to handle both aligned and unaligned strings, without having to use test, cmp mnemonics for each different alignment.
Also, i just did a small improve on the code and aligned the code itself (lea and some cmp mnemonics) and the result is really impressive. It seems to be faster then JJ2007 SSE version.
Proc FastStringCompare:
Arguments @String1, @String2
Uses esi, ecx, ebx, edx
mov esi D@String1
mov ebx D@String2
mov eax 0
mov ecx 4
mov edx D$esi
cmp edx D$ebx ; just to force the zero flags. if we have a match zero flag is settled. But we need to check if the last byte at string1 is zero.
jz L0> ; The 1st 4 bytes are not zero, jmp over
mov ecx 3
shr edx 8 | jz L1> ; If string is only 1 byte len. ecx = 3
inc ecx | jmp L1> ; else, ecx = 4
L0:
test dh dh | jz L1> ; the last byte of the 1st string is zero. We reached the end of the search. Contunue computing the remaining bytes at "L1"
lea ecx D$ecx+4 ; compute the len of the string1
Align 4 ; trying to align the code to avoid pipeline/stall problems, and get maximum results. In this case, it don´t insert any nop operation. This line can be commented out
mov edx D$esi+ecx
cmp D$ebx+ecx edx
jz L0<
jmp L3>> ; strings definitelly dont match
L1:
lea ebx D$ebx+ecx
Align 4 ; trying to align the code to avoid pipeline/stall problems, and get maximum results. In this case, it don´t insert any nop operation. This line can be commented out
mov edx D$ebx-3
lea esi D$esi+ecx
Align 4 ; trying to align the code to avoid pipeline/stall problems, and get maximum results. In this case, it insert 2 nops operations
cmp D$esi-3 edx; if both words on string1 and string2 don´ match jmp over. Otherwise, continue analysing the remaining bytes
Align 4; trying to align the code to avoid pipeline/stall problems, and get maximum results. In this case, it insert 1 nop operations
jnz L3>>
; In mine tests no matter what bytes gets here, because it always means that we fully analysed the strings, and they always matches
inc eax
L3:
EndP
Results
Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz (SE4)
loop overhead is approx. 189/100 cycles
7327 cycles for 100 * crt_strcmp
1885 cycles for 100 * FastStringCompare
2461 cycles for 100 * MasmBasic StringsDiffer
3291 cycles for 100 * STRCMP$
8410 cycles for 100 * crt_strcmp
1903 cycles for 100 * FastStringCompare
2127 cycles for 100 * MasmBasic StringsDiffer
3279 cycles for 100 * STRCMP$
7286 cycles for 100 * crt_strcmp
1847 cycles for 100 * FastStringCompare
2115 cycles for 100 * MasmBasic StringsDiffer
3278 cycles for 100 * STRCMP$
7304 cycles for 100 * crt_strcmp
1847 cycles for 100 * FastStringCompare
2094 cycles for 100 * MasmBasic StringsDiffer
3302 cycles for 100 * STRCMP$
At night i´ll try to see if this can be improved a little bit more.
Interesting, guga. Can you post the exe (and source), so that we can test it?
Hi JJ
OK, i´ll post in a couple of minutes (or 1 or 2 hours) before i finish my testings for speed and reliability. So far i fixed a small mistake i made that was generating the previous report. On unaligned strings it was missing 1 single byte during the check causing the report to decrease a few ms, and the code miss 1 byte unaligned strings.
Now, it seems to be ok, i modified the code and it seems to handle all strings (aligned and unaligned). So far, the result is:
Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz (SE4)
loop overhead is approx. 208/100 cycles
7589 cycles for 100 * crt_strcmp
2208 cycles for 100 * FastStringCompare
2539 cycles for 100 * MasmBasic StringsDiffer
2950 cycles for 100 * STRCMP$
7593 cycles for 100 * crt_strcmp
2223 cycles for 100 * FastStringCompare
2544 cycles for 100 * MasmBasic StringsDiffer
2946 cycles for 100 * STRCMP$
7578 cycles for 100 * crt_strcmp
2206 cycles for 100 * FastStringCompare
2545 cycles for 100 * MasmBasic StringsDiffer
2947 cycles for 100 * STRCMP$
7575 cycles for 100 * crt_strcmp
2209 cycles for 100 * FastStringCompare
2545 cycles for 100 * MasmBasic StringsDiffer
2947 cycles for 100 * STRCMP$
7658 cycles for 100 * crt_strcmp
2214 cycles for 100 * FastStringCompare
2544 cycles for 100 * MasmBasic StringsDiffer
2947 cycles for 100 * STRCMP$
0 = eax crt_strcmp
1 = eax FastStringCompare
0 = eax MasmBasic StringsDiffer
0 = eax STRCMP$
--- ok ---
And on my tests it didn´t missed unaligned bytes anylonger. But, i´ll make a couple of more tests before i post the source,k since i´m focusing in accuracy and then speed.
I´ll upload it as soon as possible. Hold on :)