News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

How to optimize binary search ?

Started by morgot, June 25, 2022, 10:47:49 PM

Previous topic - Next topic

morgot

There is a binary file in which you need to find an ASCII string. Then, when the string is found, you need to go back a certain number of bytes until a double zero byte is found. Then, starting from this null byte and up to the desired string, copy the data. I did the search, but it works slowly. How can it all be optimized? And how is it better to count the number of characters in the opposite direction?

And is it worth rewriting it to 64 bits, will it be faster?

StringPos proc uses esi edi ebx szString, szSubStr,dwLen,subLen : DWORD
mov esi, szString ;string
mov edi, szSubStr ;substring
mov edx, subLen ;substring length
mov ecx, dwLen ;string length
xor eax, eax ;position

@@:   
repz cmpsb ;compare
test ecx,ecx ;есх = 0 ?
jz @f ;exit
inc eax ;inc counter (position)
jmp @B ;compare again
@@:
sub eax,edx ;move pointer to the begin of string
inc eax ;without last char
ret
StringPos endp

;usage
str1 db "some fake d1ata a115huja",0,"flood data",0
str1_len equ ($-offset str1)-1

.code
start:
mov ebx,str1_len

invoke StringPos,offset str1,chr$("data"),ebx,4
lea ebx,str1
add ebx,eax
invoke OutputDebugStringA,ebx
Sorry for the bad English

hutch--

The first thing I would try is to scan the text from the front and each time you run into a double zero pair, you note its offset and keep searching for the text you want to find. The idea here is that you only search in one direction so you don't have to scan backwards to find the double zero.

The order that you search in will effect the result, from the start, simply search for your text but if you find a double zero, store its offset then keep searching for your text, you may go through this sequence a number of times but when you find the text, you already know where the previous double zero0 is.

jj2007

Quote from: morgot on June 25, 2022, 10:47:49 PM
There is a binary file in which you need to find an ASCII string. Then, when the string is found, you need to go back

How big is the binary file, and how many bytes (roughly) you need to go back? I am asking because on my trusty old i5, finding a binary string at position 39,944,179 in a 40MB binary file takes about 5 milliseconds... just trying to understand what your specific problem is :cool:

morgot

Quote from: jj2007 on June 26, 2022, 12:03:05 AM
How big is the binary file, and how many bytes (roughly) you need to go back?
File is 40-100mb. I need maximum 100bytes go back.

Quote from: hutch-- on June 25, 2022, 11:23:33 PM
The first thing I would try is to scan the text from the front and each time you run into a double zero pair, you note its offset and keep searching for the text you want to find. The idea here is that you only search in one direction so you don't have to scan backwards to find the double zero.
Unfortunately, this is a binary file, and double zero is more common there than the string I need. Those. the structure is like this:

[data...different data...double zero - the desired data (unknown size and format) - the string I need (signature)[. Therefore, I thought to look from it, because the string is unique.
Sorry for the bad English

jj2007

I'm afraid this one is using MasmBasic's Instr() - I am too tired to cook up a Masm32 SDK example:

include \masm32\MasmBasic\MasmBasic.inc
  Init
  GetFiles Thunderbird Setup*.exe ; find a recent binary file
  Let edi=FileRead$(Files$(0)) ; check C:\Users\MyName\Downloads\
  PrintCpu 0
  Print "The file ", Files$(0), Str$(" is %i bytes long\n", LastFileSize)
  push 9
  .Repeat
NanoTimer()
mov edx, LastFileSize ; required by FAST mode
void Instr_(FAST, edi, "DigiCert SHA2 Assured ID Timestamping", 64)
xchg eax, ecx
PrintLine NanoTimer$(), " for finding ", Left$(ecx, 13)
dec stack
  .Until Sign?
  pop edx
  Inkey "hit any key"
EndOfCode


Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
The file Thunderbird Setup 91.2.1.exe is 56437824 bytes long
8312 µs for finding DigiCert SHA2
7635 µs for finding DigiCert SHA2
8003 µs for finding DigiCert SHA2
7815 µs for finding DigiCert SHA2
7797 µs for finding DigiCert SHA2
7679 µs for finding DigiCert SHA2
8278 µs for finding DigiCert SHA2
7703 µs for finding DigiCert SHA2
7791 µs for finding DigiCert SHA2
7592 µs for finding DigiCert SHA2


Instr() returns in edx the relative position, in eax the start of the found string - therefore the xchg eax, ecx.

Going 100 bytes back will cost some nanoseconds extra - you won't notice the difference.

morgot

jj2007, thanks.
in Masm32 there is InString proc, but it used strlen. I fixed it, will be test...

how you test speed ?
Sorry for the bad English


morgot

hutch--,
procedure InString from masm32 if very good, but there is no such procedure in masm64. I tried to rewrite it, but don't works, return -2.

I change register from EC.. to RC.., and variables from DWORD to QWORD, but it crashes.

upd. It works, if I don't use local variables. If use local variables, it rewrite it, and program crashes. I use C compiler, may be need to make prolog manually??

InString1 proc strLen1:QWORD,lpSource:QWORD,lpPattern:QWORD
    ;LOCAL r10:QWORD R10
    ;LOCAL pLen:QWORD r11
;local startpos:qword r12



Does not work! But if I use register instead of local variables - all is good.
Sorry for the bad English

fearless

https://github.com/mrfearless/libraries/blob/master/Masm64/instring.asm

hutch--

Hi morgot, try this, it part of the 64 bit library.


FindStr proc startpos:QWORD,lpSource:QWORD,lpPattern:QWORD

  ; ------------------------------------------------------------------
  ; FindString searches for a substring in a larger string and if it is
  ; found, it returns its position in rax.
  ;
  ; It uses a one (1) based character index (1st character is 1,
  ; 2nd is 2 etc...) for both the "StartPos" parameter and the returned
  ; character position.
  ;
  ; Return Values.
  ; If the function succeeds, it returns the 1 based index of the start
  ; of the substring.
  ;  0 = no match found
  ; -1 = substring same length or longer than main string
  ; -2 = "StartPos" parameter out of range (less than 1 or longer than
  ; main string)
  ; ------------------------------------------------------------------