News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Boyer-Moore is no substitute for InString, CRT strstr & friends

Started by jj2007, May 10, 2022, 06:38:55 PM

Previous topic - Next topic

jj2007

Sometimes members ask if they can speed up their text search by using the fast Boyer-Moore and Boyer-Moore-Horspool algos provided by the Masm32 SDK (and other sources, e.g. by mineiro).

The question regularly pops up in other forums, too, e.g. here:
QuoteI was trying to improve the performance of searching files for substrings. I was using strstr() and thought it may be slower than Tuned Boyer-Moore (BM) string search algorithm.  To really test stuff, I wrote a program to compare the strstr and BM searching a 50 line string for some thing thats not in there.  I was surprised to see strstr() is faster than the Boyer-Moore almost two to three times.

Fact is that the BM algos are very good at one specific task: searching huge binary files/buffers for long recurring sequences. That is important e.g. for zipping files.

The timings below show how the BM algos behave on a medium-sized buffer (i.e. the contents of Windows.inc, 977412 bytes) when searching for an initially long, then smaller substring (open C:\Masm32\include\Windows.inc and check line 26900). As you can see, they are faster for long strings but slow on short "needles". You will also see some glitches, indicating that these algos are somewhat delicate to use, as discussed here, here and here.

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Finding a string 500 times in a 954kB 'haystack' from\Masm32\include\windows.inc

... using a 83 bytes substring [Duplicate include file windows.inc<crlf>echo --------

570 ms  for CRT strstr          result=977321
120 ms  for BMHBinsearch        result=977321
53 ms   for BMBinSearch         result=977321
50 ms   for Boyer-M Mineiro 1   result=977321
41 ms   for MB Instr_()         result=977321

... using a 71 bytes substring [Duplicate include file windows.inc<cr
569 ms  for CRT strstr          result=977321
147 ms  for BMHBinsearch        result=977321
60 ms   for BMBinSearch         result=977321
58 ms   for Boyer-M Mineiro 1   result=0
42 ms   for MB Instr_()         result=977321

... using a 60 bytes substring [Duplicate include file windows.inc<cr
569 ms  for CRT strstr          result=977321
146 ms  for BMHBinsearch        result=977321
78 ms   for BMBinSearch         result=977321
70 ms   for Boyer-M Mineiro 1   result=0
42 ms   for MB Instr_()         result=977321

... using a 51 bytes substring [Duplicate include file windows.inc<cr
569 ms  for CRT strstr          result=977321
149 ms  for BMHBinsearch        result=977321
97 ms   for BMBinSearch         result=977321
87 ms   for Boyer-M Mineiro 1   result=977321
40 ms   for MB Instr_()         result=977321

... using a 43 bytes substring [Duplicate include file windows.inc<cr
557 ms  for CRT strstr          result=977321
166 ms  for BMHBinsearch        result=977321
128 ms  for BMBinSearch         result=977321
125 ms  for Boyer-M Mineiro 1   result=977321
42 ms   for MB Instr_()         result=977321

... using a 37 bytes substring [Duplicate include file windows.inc<cr
556 ms  for CRT strstr          result=977321
232 ms  for BMHBinsearch        result=977321
102 ms  for BMBinSearch         result=977321
97 ms   for Boyer-M Mineiro 1   result=977321
42 ms   for MB Instr_()         result=977321

... using a 31 bytes substring [Duplicate include file windows.]
562 ms  for CRT strstr          result=977321
273 ms  for BMHBinsearch        result=977321
126 ms  for BMBinSearch         result=977321
122 ms  for Boyer-M Mineiro 1   result=977321
41 ms   for MB Instr_()         result=977321

... using a 26 bytes substring [Duplicate include file win]
559 ms  for CRT strstr          result=977321
322 ms  for BMHBinsearch        result=977321
224 ms  for BMBinSearch         result=977321
195 ms  for Boyer-M Mineiro 1   result=0
42 ms   for MB Instr_()         result=977321

... using a 22 bytes substring [Duplicate include file]
563 ms  for CRT strstr          result=977321
325 ms  for BMHBinsearch        result=977321
222 ms  for BMBinSearch         result=977321
210 ms  for Boyer-M Mineiro 1   result=0
42 ms   for MB Instr_()         result=977321

... using a 19 bytes substring [Duplicate include f]
562 ms  for CRT strstr          result=977321
344 ms  for BMHBinsearch        result=977321
476 ms  for BMBinSearch         result=977321
464 ms  for Boyer-M Mineiro 1   result=977321
39 ms   for MB Instr_()         result=977321

... using a 16 bytes substring [Duplicate includ]
543 ms  for CRT strstr          result=977321
381 ms  for BMHBinsearch        result=977321
227 ms  for BMBinSearch         result=977321
142 ms  for Boyer-M Mineiro 1   result=0
40 ms   for MB Instr_()         result=977321

... using a 14 bytes substring [Duplicate incl]
543 ms  for CRT strstr          result=977321
402 ms  for BMHBinsearch        result=977321
279 ms  for BMBinSearch         result=977321
262 ms  for Boyer-M Mineiro 1   result=977321
39 ms   for MB Instr_()         result=977321

... using a 12 bytes substring [Duplicate in]
543 ms  for CRT strstr          result=977321
439 ms  for BMHBinsearch        result=977321
404 ms  for BMBinSearch         result=977321
192 ms  for Boyer-M Mineiro 1   result=977321
39 ms   for MB Instr_()         result=977321

... using a 10 bytes substring [Duplicate ]
542 ms  for CRT strstr          result=977321
286 ms  for BMHBinsearch        result=977321
152 ms  for BMBinSearch         result=0
322 ms  for Boyer-M Mineiro 1   result=977321
39 ms   for MB Instr_()         result=977321

... using a 9 bytes substring [Duplicate]
544 ms  for CRT strstr          result=977321
162 ms  for BMHBinsearch        result=977321
122 ms  for BMBinSearch         result=977321
264 ms  for Boyer-M Mineiro 1   result=977321
39 ms   for MB Instr_()         result=977321

... using a 8 bytes substring [Duplicat]
544 ms  for CRT strstr          result=977321
151 ms  for BMHBinsearch        result=977321
121 ms  for BMBinSearch         result=977321
283 ms  for Boyer-M Mineiro 1   result=0
39 ms   for MB Instr_()         result=977321

... using a 7 bytes substring [Duplica]
544 ms  for CRT strstr          result=977321
167 ms  for BMHBinsearch        result=977321
135 ms  for BMBinSearch         result=977321
329 ms  for Boyer-M Mineiro 1   result=977321
39 ms   for MB Instr_()         result=977321

... using a 6 bytes substring [Duplic]
543 ms  for CRT strstr          result=977321
169 ms  for BMHBinsearch        result=977321
148 ms  for BMBinSearch         result=0
381 ms  for Boyer-M Mineiro 1   result=977321
39 ms   for MB Instr_()         result=977321


The attached source in *.asc format opens in Windows WordPad or RichMasm. On top of the source is a switch to activate the Masm32 SDK InString algo.

mineiro

Hello sir jj2007;
Thanks for testing, I repair some faults in that code.

;version 0.02
;ML.EXE /c /coff /Cp main.asm
;LINK.EXE /SUBSYSTEM:CONSOLE main.obj
;main

include \masm32\include\masm32rt.inc
.686
Boyer_Moore PROTO :dword,:dword,:dword,:dword,:dword,:dword
printf proto C :dword, :vararg          ; msvcrt

.data
format_string db "pos: %d",13,10,0
occurrences db "found %d ocurrence(s)",13,10,0


file db "\masm32\include\windows.inc",0
string db "Duplicate"
;string db "Duplicat"

align 4
Text_test db 'xx includix'
String_test db 'did'

.data?
bad_chars dd 256 dup (?)

.code
start:
;-----------------------------------
    call main ;will search for string "Duplicate" in file \masm32\include\windows.inc
;-----------------------------------
;if you want to check specific cases, comment call above and uncoment lines below to Text_test and String_test cases
;    invoke Boyer_Moore,0,addr Text_test,sizeof Text_test,addr String_test,sizeof String_test,0
;    push eax
; invoke printf,addr format_string,eax
; pop eax   
;-----------------------------------
    invoke ExitProcess,eax


main proc uses esi edi ebx

LOCAL plc:DWORD ;cmd line pointer
LOCAL saida:DWORD ;output handle
LOCAL manipulador_arquivo:dword ;file handle
LOCAL tamanho:DWORD ;source file size
LOCAL mapeado:DWORD ;ptr source file
LOCAL mem_aloc:DWORD ;ptr mem alloc
LOCAL nada:dword
LOCAL arq_mape:DWORD ;mapped file handle
LOCAL pos:DWORD ;initial position
LOCAL return_value:DWORD ;return result in eax register
LOCAL first_r:DWORD

invoke GetStdHandle,STD_OUTPUT_HANDLE
mov saida,eax

invoke GetCommandLine
mov plc,eax

invoke CreateFile,addr file,GENERIC_READ,0,0,OPEN_EXISTING,FILE_FLAG_SEQUENTIAL_SCAN,0 ;open source file
mov manipulador_arquivo,eax ;handle

invoke GetFileSize, manipulador_arquivo,NULL ;sizeof source file
mov tamanho,eax
invoke CreateFileMapping,manipulador_arquivo,NULL,PAGE_READONLY,0,0,0 ;mapped file handle
mov arq_mape,eax

invoke MapViewOfFile,eax,FILE_MAP_READ,0,0,0 ;map it, file contents pointer
mov mapeado,eax
invoke CloseHandle,manipulador_arquivo ;don't need this anymore


mov pos,0
mov first_r,0 ;inserted one more parameter to Boyer_Moore procedure
;if first_run is zero, so table creation is done
;if first_run is not zero, so table creation is skipped, gaining time execution
;so, table bad_chars was moved to data section to be globally accessed
@@:
invoke Boyer_Moore,pos,mapeado,tamanho,addr string,sizeof string,first_r
mov return_value,eax
.if return_value == -1
jmp done
.elseif return_value == -2
jmp done
.else
invoke printf,addr format_string,return_value
inc return_value
m2m pos,return_value
inc first_r ;this is a occurrences count, and will trigger to not recreate lookup table
jmp @B
.endif

done:
invoke UnmapViewOfFile,mapeado

invoke printf,addr occurrences,first_r
mov eax,return_value
    ret
main endp


Boyer_Moore proc uses ebx esi edi startpos:dword,lpSource:dword,srcLngth:dword,lpSubStr:dword,subLngth:dword,first_run:dword

cmp first_run,0
jnz @F
lea edi,bad_chars ;create a lookup table filled with substring size
mov ecx,256
mov eax,subLngth
rep stosd

lea edi,bad_chars
mov esi,lpSubStr
mov ecx,subLngth
add esi,subLngth
dec esi ;point to last char of substring, zero index based

;-------------------------
;eg: string = "batata", a=1, t=1, b=5, any other chars will be = 6 (sizeof string)
;eg: string = "done", e=1, n=1, o=2, d=3, any other chars will be = 4
;-------------------------
movzx eax,byte ptr [esi] ;read last char
mov dword ptr [edi+eax*4],1 ;its size will be one displacement, insert in bad_chars table
mov ebx,1 ;second char will be one too, third char will be 2, fourth char will be 3, ...
dec esi
dec ecx
mov edx,subLngth
.while ecx != 0 ;while exist chars in substring
movzx eax,byte ptr [esi] ;read that char


.if dword ptr [edi+eax*4] == edx ;actual char have a sizeof subLngth?
mov dword ptr [edi+eax*4],ebx ;yes, so overwrite this char displacement substring in lookup table
inc ebx ;increase displacement ;<--fault if moved to below
.endif
dec esi
dec ecx
.endw

;------------------------------
@@:
mov eax,-2 ;assume error in parameters data
mov esi,lpSubStr
mov edi,lpSource
mov ebx,srcLngth
mov ecx,subLngth

test ebx,ebx
jz done
test ecx,ecx
jz done
cmp ecx,ebx
ja done

add ebx,edi ;ebx=lpSource+srcLngth
add edi,startpos ;edi=lpSource+startpos

mov edx,subLngth
align 4
try_again:
mov eax,-1
cmp ebx,edi
jbe done
mov ecx,edx
@@:
movzx eax,byte ptr [edi+ecx-1] ;read one char from source
cmp byte ptr [esi+ecx-1],al
jne @F
dec ecx
jnz @B
sub edi,lpSource
mov eax,edi
jmp done
@@:

mov eax,dword ptr [bad_chars+eax*4]
cmp eax,edx ;test eax,edx ;<-first fault
cmove eax,ecx
add edi,eax
jmp try_again

align 4
done:
ret

Boyer_Moore endp

end start


Hmm, time results don't look good. Well, ... .
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

jj2007

Thanks, mineiro - see new version here.

Don't worry about the timings; MB has its uses, and so far your versions are the only ones that work without errors :thumbsup:

mineiro

I was thinking why grep program uses BM.
And if test is done by search for all occurrences of a word in a text? Like 10 occurrences. So that code can have a lot of gain because lookup table is need be build only at first search.
I can do some tests at night, and other variants too need travel now.
I'd rather be this ambulant metamorphosis than to have that old opinion about everything

jj2007

Quote from: mineiro on May 10, 2022, 09:22:26 PMcode can have a lot of gain because lookup table is need be build only at first search.

  NanoTimer()
  push iterations-2
  invoke Boyer_MooreM1, 0, esi, hLen, sub$, sLen, 0
  .Repeat
invoke Boyer_MooreM1, 0, esi, hLen, sub$, sLen, 1
dec stack
  .Until Sign?
  pop edx
  lea ecx, [eax+1] ; unlike other InString algos, the BM variants return a 0-based index - corrected here
  PrintLine NanoTimer$(), Str$("\tfor Boyer-M Mineiro 1 \tresult=%i", ecx)