Text only | Text with Images

The MASM Forum

Projects => MasmBasic & the RichMasm IDE => Topic started by: guga on August 04, 2023, 02:30:50 AM

Title: Instring
Post by: guga on August 04, 2023, 02:30:50 AM

Hi JJ

how do i use Instring ?

I´m having troubles trying to use in order to see if i can port it. Do you have the plain masm code for it ?

I´m trying to make a simple string search inside a text like this:

[INSTR_CASE 0] ; Is this the proper equate/flag ?
[INSTR_NO_CASE 2] ; Is this the proper equate/flag ?
[INSTR_INTELLISENSE 64] ; Is this the proper equate/flag ?

[BigString: B$ "Hello, my name is guga and i want to buy a car", 0]
[SmallString: B$ "guga", 0]

call InStringSSE2 BigString, SmallString, INSTR_NO_CASE

The goal is to search for the string "guga" (case sensitive,. case insensitive and intellisense) inside the main bigger string "Hello, my name is guga and i want to buy a car" and return the position (or address which would be better) where the string was found

What is the proper syntax for it ? I tested with masmbasic, but without any suces

Title: Re: Instring
Post by: jj2007 on August 04, 2023, 05:21:50 AM

Instr_(...) (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1153):

Code Select

                mov pos, Instr_(1, L$(n), "equ", 1+4)    ; 1=start pos, 1=case-insensitive + 4=full word
Rem        - returns relative pos in edx, absolute in eax
                - if no match is found, zero is returned in edx, and eax points to the start of the 'haystack'; same if 'needle' is empty
                - six syntax variants allowed:
                    A: has startpos:        Instr_(1, "Test", "Te", 2)                            ; 4 args
                    B: no startpos:            Instr_("Test", "Te", 2)                                ; 3 args, last one immediate = mode
                    C: has startpos:        Instr_(1, "Test", "Te")                                ; 3 args, last one not immediate
                    D: no startpos:            Instr_("Test", "Te")                        ; 2 args
)                                                                            ; 2 args
                    E: test several patterns:        InstrOr("Test", "Te" or "st", 1)                ; 3 args (startpos is 1)
                    F: extra fast mode:    Instr_(FAST, My$(ecx), "Hello", 0)            ; 4 args, no startpos, modes 0+2 only
                - case & mode (bitwise flag):
                    0=case-sensitive, +1=insensitive, +2=intellisense (Name=name, i.e. case of first char ignored),
                    +4=full word search, +8=include start of line in text block search, +4+16=not full word

InString is a Masm32 SDK algo, also used with the Masm32 find$(big$, small$) macro.
InStr_() is a MasmBasic macro.

Full test:

Code Select

include \masm32\MasmBasic\MasmBasic.inc
  SetGlobals BigString, SmallString
  Init
  Let BigString="Hello, my name is guga and i want to buy a car"
  Let SmallString="guga"
  .if Instr_(BigString, SmallString)
	xchg eax, ecx
	Inkey "[", Left$(ecx, Len(SmallString)), "]"
  .endif
EndOfCode

Shorter:

Code Select

include \masm32\MasmBasic\MasmBasic.inc
  SetGlobals Big$="Hello, my name is guga and i want to buy a car", Small$="guga"
  Init
  .if Instr_(Big$, Small$)
	xchg eax, ecx
	Inkey "[", Left$(ecx, Len(Small$)), "]"
  .endif
EndOfCode

The xchg eax, ecx is needed because Len() destroys resp. returns eax.

Title: Re: Instring
Post by: mineiro on August 04, 2023, 08:21:15 AM

Hello sir guga;
Search by Boyer Moore into this board. Need some optimizations link bellow:
https://masm32.com/board/index.php?msg=109922
Sir jj2007 have done some tests, you can check results from that post to front.

Title: Re: Instring
Post by: jj2007 on August 04, 2023, 01:28:22 PM

Quote from: mineiro on August 04, 2023, 08:21:15 AMHello sir guga;
Search by Boyer Moore into this board. Need some optimizations link bellow:
https://masm32.com/board/index.php?msg=109922
Sir jj2007 have done some tests, you can check results from that post to front.

Timings are here (https://masm32.com/board/index.php?topic=10040.msg109942#msg109942). Example:

Code Select

... using a 6 bytes substring [Doplic]
599 ms  for CRT strstr          result=105560497
563 ms  for M32 InString        result=105560497
181 ms  for BMHBinsearch        result=105560497
158 ms  for BMBinSearch         result=0
420 ms  for Boyer-M Mineiro 1   result=105560497
417 ms  for Boyer-M Mineiro 2   result=105560497
61 ms   for MB Instr_()         result=105560497

Title: Re: Instring
Post by: lingo on August 05, 2023, 04:44:49 PM

Code Select


.const
EQUAL_ANY           equ 0000b
RANGES              equ 0100b
EQUAL_EACH          equ 1000b
EQUAL_ORDERED       equ 1100b
NEGATIVE_POLARITY   equ 010000b
BYTE_MASK           equ 1000000b

.code
align 16     
InstrLingo      proc        haystack:DWORD,needle:DWORD
                mov         ecx,  haystack                 ; ecx = haystack, edx = needle
                push        esi
                mov         edx,  needle
                push        edi
                mov         eax,  ecx 
                movdqu      xmm2, oword ptr[edx]           ; load 1st 16 bytes of needle
                pxor        xmm3, xmm3
@Loop:                                                     ; find the first possible match of 16-byte fragment in haystack        
                pcmpistri   xmm2, oword ptr[eax], EQUAL_ORDERED
                lea         eax, [eax+16]
                ja          @Loop
                lea         eax, [eax+ecx-16]               ; save the possible match start
                mov         edi, edx
                mov         esi, eax
                jnc         @NotFound
                sub         edi, eax 
@@:                                                         ; compare the strings
                movdqu      xmm1, oword ptr [esi + edi]
                pcmpistrm   xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK ; mask out invalid bytes in the haystack
                movdqu      xmm4, oword ptr[esi]
                pand        xmm4, xmm0
                pcmpistri   xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
                lea         esi, [esi+16]
                ja          @b
                jnc         @Found
                add         eax, 1                          ; continue searching from the next byte
                jmp         @Loop
@NotFound:
                xor         eax, eax
@Found:     
                pop         edi
                pop         esi
                ret
InstrLingo      endp

:skrewy: :biggrin:

Title: Re: Instring
Post by: jj2007 on August 05, 2023, 06:51:58 PM

Hi Lingo,

Not surprisingly (https://masm32.com/board/index.php?topic=9237.msg117041#msg117041), your code performs exactly like MasmBasic Instr_(), but how come it returns a different position? Did you forget to subtract the start position?

Code Select

Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Finding a string 5 times in a 100MB 'haystack' from\Masm32\include\Windows.inc
... using a 8 bytes substring [Doplicat]
603 ms  for CRT strstr          result=105560497
586 ms  for M32 InString        result=105560497
162 ms  for BMHBinsearch        result=105560497
134 ms  for BMBinSearch        result=0
323 ms  for Boyer-M Mineiro 1  result=0
324 ms  for Boyer-M Mineiro 2  result=0
60 ms  for Lingo              result=143964624
63 ms  for MB Instr_()        result=105560497

... using a 7 bytes substring [Doplica]
624 ms  for CRT strstr          result=105560497
588 ms  for M32 InString        result=105560497
172 ms  for BMHBinsearch        result=105560497
146 ms  for BMBinSearch        result=105560497
374 ms  for Boyer-M Mineiro 1  result=105560497
369 ms  for Boyer-M Mineiro 2  result=105560497
62 ms  for Lingo              result=143964624
63 ms  for MB Instr_()        result=105560497

... using a 6 bytes substring [Doplic]
610 ms  for CRT strstr          result=105560497
581 ms  for M32 InString        result=105560497
186 ms  for BMHBinsearch        result=105560497
162 ms  for BMBinSearch        result=0
423 ms  for Boyer-M Mineiro 1  result=105560497
427 ms  for Boyer-M Mineiro 2  result=105560497
63 ms  for Lingo              result=143964624
62 ms  for MB Instr_()        result=105560497

Title: Re: Instring
Post by: Caché GB on August 05, 2023, 10:10:20 PM

Hi lingo

Thanks for this algo, it works perfectly.

Code Select


         invoke  InstrLingo, addr longString, addr shortString

            lea  ecx, longString        ;  eax returns Address of shortString
            sub  eax, ecx
            mov  BytePosition, eax      ;  now eax = Position of shortString

Awrsome. :thumbsup:

Title: Re: Instring
Post by: jj2007 on August 05, 2023, 10:29:45 PM

Quote from: Caché GB on August 05, 2023, 10:10:20 PMHi lingo

Thanks for this algo, it works perfectly.

It was added to MasmBasic Instr_() (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1153) over a year ago. Credits should go to Peter Kankowski (https://www.strchr.com/strcmp_and_strlen_using_sse_4.2) (13 years ago):

Code Select

strstr_sse42:
  ; ecx = haystack, edx = needle

  push esi
  push edi
  MovDqU xmm2, dqword[edx] ; load the first 16 bytes of neddle
  Pxor xmm3, xmm3
  lea eax, [ecx - 16]

  ; find the first possible match of 16-byte fragment in haystack
STRSTR_MAIN_LOOP:
    add eax, 16
    PcmpIstrI xmm2, dqword[eax], EQUAL_ORDERED
    ja STRSTR_MAIN_LOOP

  jnc STRSTR_NOT_FOUND

  add eax, ecx ; save the possible match start
  mov edi, edx
  mov esi, eax
  sub edi, esi
  sub esi, 16

  ; compare the strings
@@:
    add esi, 16
    MovDqU    xmm1, dqword[esi + edi]
    ; mask out invalid bytes in the haystack
    PcmpIstrM xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK
    MovDqU xmm4, dqword[esi]
    PAnd xmm4, xmm0
    PcmpIstrI xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
    ja @B

  jnc STRSTR_FOUND

  ; continue searching from the next byte
  sub eax, 15
  jmp STRSTR_MAIN_LOOP

STRSTR_NOT_FOUND:
  xor eax, eax

STRSTR_FOUND:
  pop edi
  pop esi
  ret

Compare that to Lingo's code:

Code Select

InstrLingo      proc        haystack:DWORD,needle:DWORD
                mov         ecx,  haystack                 ; ecx = haystack, edx = needle
                push        esi
                mov         edx,  needle
                push        edi
                mov         eax,  ecx
                movdqu      xmm2, oword ptr[edx]           ; load 1st 16 bytes of needle
                pxor        xmm3, xmm3
@Loop:                                                     ; find the first possible match of 16-byte fragment in haystack       
                pcmpistri   xmm2, oword ptr[eax], EQUAL_ORDERED
                lea         eax, [eax+16]
                ja          @Loop
                lea         eax, [eax+ecx-16]               ; save the possible match start
                mov         edi, edx
                mov         esi, eax
                jnc         @NotFound
                sub         edi, eax
@@:                                                         ; compare the strings
                movdqu      xmm1, oword ptr [esi + edi]
                pcmpistrm   xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK ; mask out invalid bytes in the haystack
                movdqu      xmm4, oword ptr[esi]
                pand        xmm4, xmm0
                pcmpistri   xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
                lea         esi, [esi+16]
                ja          @b
                jnc         @Found
                add         eax, 1                           ; continue searching from the next byte
                jmp         @Loop
@NotFound:
                xor         eax, eax
@Found:     
                pop         edi
                pop         esi
                ret
InstrLingo      endp

MasmBasic's Instr_() has minor optimisations for speed and size but is basically the same.

Title: Re: Instring
Post by: lingo on August 05, 2023, 11:24:15 PM

Thank you JJ,

My algo works without error.
Tested it with similar texts from the attachment. :biggrin:

Title: Re: Instring
Post by: Biterider on August 05, 2023, 11:33:37 PM

Hi
The code is fast but potentially unsafe.
If the fetched memory blocks cross a page boundary to an uncommited page, an exception is raised.
The only way to use this code safely is to align the strings to 16.

Biterider

Title: Re: Instring
Post by: jj2007 on August 05, 2023, 11:48:52 PM

Quote from: lingo on August 05, 2023, 11:24:15 PMMy algo works without error.

If you subtract the source, as noted by myself and Caché GB, Kankowski's algo works fine indeed.

624 ms for CRT strstr result=105560497
588 ms for M32 InString result=105560497
172 ms for BMHBinsearch result=105560497
146 ms for BMBinSearch result=105560497
374 ms for Boyer-M Mineiro 1 result=105560497
369 ms for Boyer-M Mineiro 2 result=105560497
62 ms for Lingo result=143964624
63 ms for MB Instr_() result=105560497

Quote from: Biterider on August 05, 2023, 11:33:37 PMThe code is fast but potentially unsafe.
If the fetched memory blocks cross a page boundary to an uncommited page, an exception is raised.
The only way to use this code safely is to align the strings to 16.

Biterider

I had that suspicion, too. Maybe we need a test case...

Title: Re: Instring
Post by: Biterider on August 06, 2023, 06:29:01 PM

Quote from: jj2007 on August 05, 2023, 11:48:52 PMI had that suspicion, too. Maybe we need a test case...

Searching a bit I found some discussions about this issue, one of them is
https://www.purebasic.fr/english/viewtopic.php?p=605105 (https://www.purebasic.fr/english/viewtopic.php?p=605105)

One possible solution is to align the pointer for the retrieved characters and invalidate the first characters that are not part of the string. Unfortunately, masking is not possible with this instruction, but the invalid characters can be replaced with non-zero characters (byte/word), e.g. "01", that are not in the string to be searched.
This means a little more execution time, but then the function could be used universally.

Biterider

Title: Re: Instring
Post by: Biterider on August 06, 2023, 07:14:52 PM

Hi
When trying to implement something to solve the pafe boundary problem, I quickly discovered that the "needle" has exactly the same problem, which increases the complexity considerably.

Atm I can't see a simple solution for both issues that are efficient enough. :nie:

I wonder sometimes what the developers of such instructions have in mind when it comes to implementing the code... :eusa_pray:

Biterider

Title: Re: Instring
Post by: jj2007 on August 06, 2023, 08:31:45 PM

Quote from: Biterider on August 06, 2023, 07:14:52 PMAtm I can't see a simple solution for both issues that are efficient enough. :nie:

Test case:

Code Select

  invoke VirtualAlloc, 0, 4096, MEM_COMMIT, PAGE_READWRITE
  xchg eax, edi
  invoke lstrcpyn, edi, esi, 4096	; esi is a pointer e.g. to contents of Windows.inc

Now perform the instr() on edi :cool:

It works fine with SEH, but I agree there is no other efficient solution :sad:

Title: Re: Instring
Post by: Biterider on August 06, 2023, 08:52:20 PM

As it stands now, I strongly recommend not using these implementations for general purposes, as this will lead to unexplained and random crashes in customer code, resulting in a sh*storm.

Biterider

Title: Re: Instring
Post by: jj2007 on August 06, 2023, 09:39:20 PM

Quote from: Biterider on August 06, 2023, 08:52:20 PMAs it stands now, I strongly recommend not using these implementations for general purposes, as this will lead to unexplained and random crashes in customer code, resulting in a sh*storm.

I've used them for over a year in all my code, no crashes. HeapAlloc'ed strings and buffers are not a problem, in contrast to this:

Code Select

  invoke VirtualAlloc, 0, 4096, MEM_COMMIT, PAGE_READWRITE
  xchg eax, edi
  Let esi=FileRead$("\Masm32\include\Windows.inc")
  invoke lstrcpyn, edi, esi, 4096	; esi is a pointer e.g. to contents of Windows.inc
  Let sub2$="ACMFILTERCHOOSEHOOKPROC     typ"	; "typ" is ok, "type" fails with an exception
  Print Str$("The pos is %i\n", Instr_(edi, sub2$))

The solution is SEH.

Title: Re: Instring
Post by: Biterider on August 06, 2023, 10:29:37 PM

Hi JJ
It is only a matter of time before this code bites you. :badgrin:

If the SEH handler has to catch this situation, what should it do? Fall back on the regular implementation?
Thinking about that, you have to protect every single call to these new procedures...
A higher level handler is also possible, but the complexity does not justify it.

Another side effect is that thousands and thousands of CPU cycles are wasted when triggering the SEH mechanism.

Still not a good solution for production code...

Biterider

Title: Re: Instring
Post by: jj2007 on August 06, 2023, 11:27:25 PM

Quote from: Biterider on August 06, 2023, 10:29:37 PMIf the SEH handler has to catch this situation, what should it do? Fall back on the regular implementation?

Return zero = not found.

QuoteThinking about that, you have to protect every single call to these new procedures...

Once, at program start.

QuoteAnother side effect is that thousands and thousands of CPU cycles are wasted when triggering the SEH mechanism.

Nope, the timings are exactly the same if a match is found. If no match is found and the SEH is triggered, then you have a design problem.

Code Select

  invoke VirtualAlloc, 0, 4096, MEM_COMMIT, PAGE_READWRITE
  xchg eax, edi
  Let esi=FileRead$("\Masm32\include\Windows.inc")
  invoke lstrcpyn, edi, esi, 4096	; esi is a pointer e.g. to contents of Windows.inc
  Let sub32$="ACMFILTERCHOOSEHOOKPROC     type"	; type fails with an exception, "typ" is ok
  Let sub31$=Left$(sub32$, 31)
  Print Str$("For 31 chars, the pos is %i\n", Instr_(edi, sub31$))
  Print Str$("For 32 chars, the pos is %i\n", Instr_(edi, sub32$))
  Inkey "That was fun with SEH"

Output:

Code Select

For 31 chars, the pos is 4058
Caught at pos:  $esi            def DW
For 32 chars, the pos is 0
That was fun with SEH

Title: Re: Instring
Post by: Biterider on August 07, 2023, 02:36:32 AM

Quote from: jj2007 on August 06, 2023, 11:27:25 PM
QuoteIf the SEH handler has to catch this situation, what should it do? Fall back on the regular implementation?

Return zero = not found.

It is obviously wrong to always return 0 when you have a match, which leads to wrong results.

Quote from: jj2007 on August 06, 2023, 11:27:25 PM
QuoteThinking about that, you have to protect every single call to these new procedures...

Once, at program start.

You are talking about a TL-handler that has to deal with all kinds of exceptions coming from somewhere in your application.
This means that you have to analyse exactly where the exception occurred and what went wrong, and then unwind the whole thing to the offending procedure, which is a leaf procedure.

Quote from: jj2007 on August 06, 2023, 11:27:25 PM
QuoteAnother side effect is that thousands and thousands of CPU cycles are wasted when triggering the SEH mechanism.

Nope, the timings are exactly the same if a match is found. If no match is found and the SEH is triggered, then you have a design problem.

You misinterpreted my comment, I said when an exception is triggered. That comes from the transitions from user-land to ring 0 and vice-versa.

The whole thing is not about SEH, it is about code quality degradation. :nie:
It is expected that when a proc that works perfectly fine is replaced by a better/faster version that it returns exactly the same results.
It makes no sense to replace it with a faster version that may work 99.9% but fails under some conditions the user has no control over. :eusa_naughty:

For sure there are cases where a wrong result can be tolerated, but imagine you use this code in a critical process and this super fast routine fails 1/1000 times crashing a machine/car/rocket.
No one in their right mind would use it as a general purpose routine for a few microseconds that don't matter in the end.

I don't want to hurt anyone's feelings, but wouldn't it be better to find a better and suitable solution? :wink2:

Biterider

Title: Re: Instring
Post by: jj2007 on August 07, 2023, 06:11:13 AM

Hi Biterider, it seems I edited your post and made a mess out of it - apologies.

QuoteIt is obviously wrong to always return 0 when you have a match, which leads to wrong results.

Describe a situation where an exception was triggered but there still was a match.

QuoteYou are talking about a TL-handler that has to deal with all kinds of exceptions coming from somewhere in your application.
This means that you have to analyse exactly where the exception occurred and what went wrong, and then unwind the whole thing to the offending procedure, which is a leaf procedure.

You didn't check the attachment posted above.

Quote
Quote from: jj2007 on August 06, 2023, 11:27:25 PM
QuoteAnother side effect is that thousands and thousands of CPU cycles are wasted when triggering the SEH mechanism.

Nope, the timings are exactly the same if a match is found. If no match is found and the SEH is triggered, then you have a design problem.

You misinterpreted my comment, I said when an exception is triggered. That comes from the transitions from user-land to ring 0 and vice-versa.

Yes, we are wasting a millisecond or so in the extremely unlikely event of an exception.

Title: Re: Instring
Post by: Biterider on August 07, 2023, 06:47:57 AM

Quote from: jj2007 on August 07, 2023, 06:11:13 AMDescribe a situation where an exception was triggered but there still was a match.

Suppose you have an ANSI zero terminated string "abcABC" at the end of a page and the next page is not commited.
Now you start searching for "ABC" at a 16 byte unaligned address. You overshoot the page boundary by 9 bytes raising an exception and you should return a match result.

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

?

a

b

c

A

B

C

0

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Biterider

Title: Re: Instring
Post by: jj2007 on August 08, 2023, 09:29:31 PM

Well, you have a point there :thumbsup:

It's not so relevant for MasmBasic, because strings and buffers are all HeapAlloc'ed with a little reserve of 24 bytes that helps to avoid this scenario.

See Peter Cordes' very detailed description of pcmpistri & pcmpestri performance at SOF (https://stackoverflow.com/questions/58901232/why-is-sse4-2-cmpstr-slower-than-regular-code)

I have a version that checks the bounds, and it's only a tick slower than the fast version, but I really don't think it's worth the effort. Real life strings and buffers should never end at no man's land.

Btw this logic should be applied to all code. Attached an example that shows the behaviour of the Masm32 SDK len() macro. You might be surprised :cool:

Title: Re: Instring
Post by: guga on September 19, 2024, 03:16:19 AM

Hi, JJ, i´m trying to create a benchmark test on a new BinarySearch/Instring function i did (specifically with SSE2 only), but i couldn´t be able to implement it on your benchmark app to test it. Can, u please input it on your bench algo so we can test the speed of it ?

It is designed to work with any pattern size. So, it can handle strings (or data chunks) from 1 to XXX bytes. Since i focused on speed, i didn´t implemented on the function an error check in case the pattern is smaller than the dataset or check if the size of the data/pattern are zeroed etc. The size of the dataset must be bigger or equal to 16 bytes long

I attached the masm compatible source code (the same syntax i used when testing on your benchmark app).

If you need explanations on usage here is The RosAsm code (With explanations of how the function works):

Main function SIMD_BinaryScan

Code Select


;;
        SIMD_BinaryScan
            This function performs a binary search for a chunk of data (pSearchPattern) within another larger dataset (pDataBuffer).
            It uses SSE2 instructions for optimization and speed, especially with large data sets.
            The function can be used to find data from any size, including strings (case sensitive) on a big text

        Parameters:

            pDataBuffer (in):       Pointer to the start of the larger dataset in which the search is performed.
                                    This is the buffer in which you are searching.
            DataSize (in):          The size (in bytes) of the Data.
            pSearchPattern (in):    Pointer to the start of the Search Pattern (the data chunk to search for within the DataSet).
                                    This is the smaller data chunk you are searching for.
            PatternSize (in):       The size (in bytes) of the SearchPattern.


        Return Value:

            Success: If the Pattern is found, the function returns a pointer to the starting position of the matching chunk in the Dataset.
            Failure: Returns 0 (&NULL) if the Pattern is not found in the Dataset.

        Remarks:
            Main Function Flow:

            1 - Initialize First and Last Byte of the Pattern:
                The first and last bytes of the Pattern are expanded into SSE registers xmm2 and xmm3 for comparison against blocks of the DataSet.

            2 - Loop through Dataset Blocks:
                Each block of 16 bytes from the Dataset is loaded into xmm0 and xmm1 to compare the first and last bytes with the corresponding bytes of the pattern.
                SSE2 instructions (pcmpeqb, pand, pmovmskb) are used to generate masks and compare these blocks efficiently.

            3 - Check for Match:
                If matching bits are found, the position of the first matching byte is retrieved, and a detailed byte-by-byte comparison is done with
                SIMD_ComparePattern to confirm the match.
                If no match is found, the loop continues with the next block of data in the Dataset.

            4 - Handle Remainder:
                If the Dataset's size is not a multiple of 16, any remaining bytes are handled separately at the end of the Dataset.

            5 - Return Result:
                If the pattern is found, the function returns the position of the match; otherwise, it returns 0 (no match).

        Example of usage:
        
        [haystack: B$ "This is a simples string to search within. While trying to find sinp from sinples string", 0]
        [needle: B$ "search wi", 0]

            call 'FastCRT.StrLen' haystack
            mov ebx eax
            call 'FastCRT.StrLen' needle

            call SIMD_BinaryScan haystack, ebx, needle, eax
;;

Proc SIMD_BinaryScan:
    Arguments @pDataBuffer, @DataSize, @pSearchPattern, @PatternSize
    Local @BitPos, @StringEnd, @MaxPos
    Structure @XmmPreserve 80, @XMMReg0Dis 0, @XMMReg1Dis 16, @XMMReg2Dis 32, @XMMReg3Dis 48, @XMMReg4Dis 64
    Uses esi, ebx, edx, edi

    ; save the contents xmm registers to avoid altering them
    lea eax D@XMMReg0Dis | movdqu X$eax XMM0
    lea eax D@XMMReg1Dis | movdqu X$eax XMM1
    lea eax D@XMMReg2Dis | movdqu X$eax XMM2
    lea eax D@XMMReg3Dis | movdqu X$eax XMM3

    ; 1st get pointer to the string/data to be found
    mov esi D@pSearchPattern
    ; get the 1st char of the
    movzx eax B$esi | SSE2_EXPAND_BYTE2 xmm2 eax ; xmm2 = extended 1st byte of the string xmm2 = Var_First

    mov eax esi | add eax D@PatternSize
    movzx eax B$eax-1 | SSE2_EXPAND_BYTE2 xmm3 eax ; xmm3 = extended last byte of the string xmm3 = Var_Last

    mov esi D@pDataBuffer | mov eax esi | add eax D@DataSize | sub eax 16 | mov D@StringEnd eax
    ..Do
        movups XMM0 X$esi; xmm0 = Var_BlockFirst.
        mov eax D@PatternSize
        movups XMM1 X$esi+eax-1; xmm1 = Var_BlockLast.

        ; we need to create a mask formed by the comparition of the last bytes.
        ; Var_EqFirst = comparition between Var_First and Var_BlockFirst. The result is stored in xmm0
        pcmpeqb xmm0 xmm2   ; compare Var_First with previous Var_BlockFirst = Var_EqFirst

        ; Var_EqLast = comparition between Var_Last and Var_BlockLast. The result is stored in xmm1
        pcmpeqb xmm1 xmm3   ; compare Var_Last with previous Var_BlockLast = Var_EqLast

        pand xmm1 xmm0      ; Now we simply need to and those masks (Var_EqFirst and Var_EqLast). We store in xmm1 to preserve xmm0 to reuse below
        pmovmskb ebx XMM1   ; And finally create that mask and copy it onto eax. Maximum value = 0FFFF
        .Test_If_Not_Zero ebx
            Do
                ; get first bit_set
                GET_FIRST_BIT_SET ebx | mov D@BitPos eax
                lea edi D$esi+eax
                call SIMD_ComparePattern edi, D@pSearchPattern, D@PatternSize
                cmp eax 0 | jnz L2>>
                
                ;bsr ax bx | btr bx ax | jc L1< ; in practise, the carry flag will be settled only when ebx (bx) is not zero. So, the loop will works whenever ebx <> 0 
                ; clear leftmost bit_set
                CLEAR_LEFTMOST_SET16 bx
            Repeat_Until_Zero
            ; now we store the last known bitpos where the string was not matched,
            ; and add it to the string size to get the next position to search
            mov eax D@BitPos | add eax D@PatternSize
        .Test_Else
            pmovmskb eax xmm0 ; And finally create that mask and copy it onto eax. Maximum value = 0FFFF
            Test_If_Not_Zero eax
                ; get last bit_set
                GET_LAST_BIT_SET eax
                inc eax ; The last pair of bits didn´ matched. Therefore, we need to increase only by 1 to do to the next position
            Test_Else
                mov eax 16
            Test_End
        .Test_End
        add esi eax
    ..Loop_Until esi > D@StringEnd

    ; Now handle remainder
    .If D@PatternSize <= 16
        mov esi D@StringEnd
        movups XMM0 X$esi; xmm0 = Var_BlockFirst.
        ; we need to create a mask formed by the comparition of the last bytes.
        ; Var_EqFirst = comparition between Var_First and Var_BlockFirst. The result is stored in xmm0
        pcmpeqb xmm0 xmm2; compare Var_First with previous Var_BlockFirst = Var_EqFirst
        pmovmskb ebx XMM0 ; Create that mask and copy it onto eax. Maximum value = 0FFFF
        .Test_If_Not_Zero ebx
            mov eax 16 | sub eax D@PatternSize | mov D@MaxPos eax
            Do
                ; get first bit_set
                GET_FIRST_BIT_SET ebx; | mov D@BitPos eax
                cmp eax D@MaxPos | jnbe L1> ; exit if data is not below or equal (So, jump if bigger than ...)
                lea edi D$esi+eax
                call SIMD_ComparePattern edi, D@pSearchPattern, D@PatternSize
                cmp eax 0 | jnz L2>
                 ; clear leftmost bit_set
                CLEAR_LEFTMOST_SET16 bx
            Repeat_Until_Zero
        .Test_End
    .End_If
L1:

    xor eax eax

L2:

    lea esi D@XMMReg0Dis | movdqu XMM0 X$esi
    lea edi D@XMMReg1Dis | movdqu XMM1 X$edi
    lea esi D@XMMReg2Dis | movdqu XMM2 X$esi
    lea edi D@XMMReg3Dis | movdqu XMM3 X$edi

EndP

Internal byte compare function

Code Select


;;
    SIMD_ComparePattern

    This function, performs a detailed byte-by-byte comparison of two buffers. It is called once a potential
    match is found by the SSE2 block comparison.

    Parameters:

        pBuffer1 (in):  Pointer to the block in the haystack (loaded in esi).
        pBuffer2 (in):  Pointer to the needle (loaded in edi).
        length (in):    Length (in bytes) of the buffers to be compared.

    Return Value:

        Success: Returns the pointer to pBuffer1 if the contents match exactly.
        Failure: Returns 0 if the contents do not match.

    Remarks:
        This function is an adaptation of memcmp_SSE2, but to be used only inside SIMD_BinaryScan.
        It is also developed for speed, for this reason it don't preserve the registers (except esi)
        that are being used in SIMD_BinaryScan.

;;

Proc SIMD_ComparePattern:
    Arguments @pBuffer1, @pBuffer2, @Lenght
    Local @Reminder
    Uses esi; No need to preserve edi and others registers too since this function is used only inside SIMD_BinaryScan.

    ; Step1 - Check if the lenght is zeroed. If lenght is 0, jump over the whole function
    mov edx D@Lenght
    mov esi D@pBuffer1
    mov edi D@pBuffer2
    and edx 0-16 | mov eax edx | xor edx D@Lenght ; When 0 means lenght is divisible by 16 and we have no remainders, jmp over to the main function

    ; Step2
    .Test_If_Not_Zero eax ; edx = 0 ? no remainders exit
        mov D@Reminder edx
        ; We are working with multiples of 16 bytes, therefore, we can get rid of using shr eax 4 and directly subtract 16 bytes on each loop
        ; On this way we get a bit more of speed while keeping performance. Interesting is that add eax 0-16 seems a bit faster then sub eax 16
        .Do
            movdqu xmm0 X$esi
            movdqu xmm1 X$edi
            pcmpeqd xmm0 xmm1
            movmskps edx xmm0
            If edx <> 0F
                xor eax eax | jmp L4>>
            End_If
            add esi 16
            add edi 16
            add eax 0-16 ; subtract by 16 bytes. Faster then sub
        .Repeat_Until_Zero
        mov edx D@Reminder
    .Test_End

    .Test_If_Not_Zero edx

        ; Now becomes the trick. Here we will calculate the remainders of our previous routine stored in edx
        ; So, assuming our initial value was 43, we now have only 11 as remainder, which is 2 Dwords (8 bytes) + 3 bytes
        ; So we have 2 main blocks to check. One containing only the dwords checking. So, a value whose range is 0 to 12 (multiple of 4 only, so, 0, 4, 8, 12
        ; and other related to the final bytes (3), whose values are a maximum of 3, so 0, 1, 2, 3
        ; We will store the Dwords computation in dl and the Byte computation in dh
        ; This way is faster then we use shr eax 2 or something to divide the dword by 4 and later compute the other byte remainders

        mov dh dl
        and dl 0-4 ; dl = our Dword computation. So, values multiples of 4 only, on a maximum of 12
        xor dh dl ; dh = our byte computation. So, values are multples of 1 only, on a maximum of 3
        Test_If_Not_Zero dl
            ; If dl = 0, so if we have no dwords remainder, jmp over to compute the byte remainders
            mov eax D$esi | cmp eax D$edi | jz L1> | xor eax eax | jmp L4> | L1: | add esi 4 | add edi 4 | add dl 0-4 | jz L2>
            mov eax D$esi | cmp eax D$edi | jz L1> | xor eax eax | jmp L4> | L1: | add esi 4 | add edi 4 | add dl 0-4 | jz L2>
            mov eax D$esi | cmp eax D$edi | jz L1> | xor eax eax | jmp L4> | L1: | add esi 4 | add edi 4
        Test_End
        L2:

        ; check if we have no byte remainders
        Test_If_Not_Zero dh ; dh = our byte computation. So, values are multples of 1 only, on a maximum of 3
            mov eax 0 ; surprisinly a bit faster then xor eax eax
            mov dl B$esi | cmp dl B$edi | jnz L4> | dec dh | jz L2>
            mov dl B$esi+1 | cmp dl B$edi+1 | jnz L4> | dec dh | jz L2>
            mov dl B$esi+2 | cmp dl B$edi+2 | jnz L4>
            L2:
        Test_End
    .Test_End
    mov eax D@pBuffer1
L4:

EndP

Macros used:

Code Select


;;
    SSE2_EXPAND_BYTE and SSE2_EXPAND_BYTE_FAST

        These macros replicates a single byte from a 32-bit register across all 16 bytes (128 bits) of an XMM register.
        They are similar to _mm_set1_epi8() in SSE2.

    SSE2_EXPAND_BYTE        Uses only SSE2 registers. So it preserves normal x86 registers while working.
    SSE2_EXPAND_BYTE_FAST   performs a similar task but uses an alternate method to replicate the byte by multiplying it by 0x01010101
                            before moving it to the XMM register. The only difference is that it changes the value of edx. So, if you intend
                            to use it, make sure to preserve edx before using it, in case you need edx in other parts of your code.

    SSE2_EXPAND_BYTE
        Explanation and Usage:
            movd #1, #2:        Move the 32-bit value in the normal register (#2) into the lower 32 bits of the XMM register (#1).
            punpcklbw #1, #1:   Unpack the bytes in the lower part of the XMM register into words, filling all 16 bits with the original byte.
            punpcklwd #1, #1:   Unpack the words into double words, expanding the byte across more space.
            pshufd #1, #1, 0:   Shuffle the four double words to fill all 16 bytes of the XMM register with the same byte.

        Example of usage:

            SSE2_EXPAND_BYTE xmm0 eax ; expands the byte in eax to fill xmm0.

    SSE2_EXPAND_BYTE_FAST
        Explanation and Usage:
            imul #2, #2, 01010101:          Multiply the value in the 32-bit register by 0x01010101, replicating the byte across the register.
            movd #1, #2:                    Move the replicated byte from the normal register (#2) to the XMM register (#1).
            pshufd #1, #1, SSE_FILL_DWORDS: Shuffle the double words to fill all 16 bytes in the XMM register with the byte.

        [SSE_FILL_DWORDS 0] ; equate used for shuffling XMM registers to fill double words (pshufd shuffle mask).

;;

[SSE_FILL_DWORDS 0]
[SSE2_EXPAND_BYTE | movd #1 #2 | punpcklbw #1 #1 | punpcklwd #1 #1 | pshufd #1 #1 0]
[SSE2_EXPAND_BYTE2 | imul #2 #2 01010101 | movd #1 #2 | pshufd #1 #1 SSE_FILL_DWORDS]

;;
    CLEAR_LEFTMOST_SET16
    This macro clears the left most distance bit on a 16 bit data.
    Parameters
      #1 (in/out)- A word size data or registers. Since the macro uses ax to compute, don´t use ax registers as input.

    Return value:
        The macro will return in the input the converted 16bit value. Also it will set the zero flag if it results 0

    Remarks:
        The macro uses ax register internaly. So on input you cannot use ax or you will get a incorrect value


    Explanation:
        mov ax, #1 – Move the value of the 16-bit register (like bx, cx, etc.) into ax.
        add #1, 0-1 – Subtract 1 from the value of the 16-bit register.
        and #1, ax – Perform a bitwise AND between the original value and the decremented value (after subtracting 1).


    Example of usage:

        CLEAR_LEFTMOST_SET16 bx

        Will unroll to: mov ax bx | add bx 0-1 | and bx ax
;;

[CLEAR_LEFTMOST_SET16 | mov ax #1 | add #1 0-1 | and #1 ax]

;;
    GET_FIRST_BIT_SET
    
    This macro locates the position of the first (lowest) set bit in the given register.

    Macro Code:
        [GET_FIRST_BIT_SET | bsf eax, #1]

    Explanation and Usage:

        bsf eax, #1: Bit Scan Forward—find the position of the first set bit in #1 and store the result in eax.

    Finds the position of the first set bit (least significant bit) in the given register.
    Example: GET_FIRST_BIT_SET ebx ; stores the position of the first set bit in ebx into eax. 
;;


[GET_FIRST_BIT_SET | bsf eax #1]

;;
    GET_LAST_BIT_SET

        This macro locates the position of the last (highest) set bit in the given register.

    Macro Code:

        [GET_LAST_BIT_SET | bsr eax, #1]

    Explanation and Usage:

        bsr eax, #1: Bit Scan Reverse—find the position of the last set bit in #1 and store the result in eax.

    Finds the position of the last set bit (most significant bit) in the given register.
    Example: GET_LAST_BIT_SET ebx ; stores the position of the last set bit in ebx into eax. 
;;

[GET_LAST_BIT_SET | bsr eax #1]

Title: Re: Instring
Post by: jj2007 on September 19, 2024, 05:52:02 PM

Hi Guga,

Attached the source. There is a problem with the parameters, though. Are you sure the sub esp, 4 is correct?

Jochen

Title: Re: Instring
Post by: guga on September 20, 2024, 12:34:19 AM

Hi JJ

tks you a lot.

Yes, i´m sure.

The problem you found is because you called the internal function and not the main function.

The main function to use is: SIMD_BinaryScan (Uses 4 parameters)
While SIMD_ComparePattern (Uses 3 parameters) is used only inside the SIMD_BinaryScan without preserving any registers (That are reused by the main function) and cannot be used alone (outside that function). The SIMD_ComparePattern function is a adaptation i made for the previous memcmp_SSE2 i posted here https://masm32.com/board/index.php?topic=12256.msg133369#msg133369

Replace this:

Code Select

NameA equ Guga compare pattern    ; assign a descriptive name for each test
TestA proc
  mov ebx, AlgoLoops-1    ; loop e.g. 100x
  align 4
  .Repeat
    push 4
    push chr$("zero")
    push 100
    push offset somestring
    call SIMD_ComparePattern
    dec ebx
  .Until Sign?
  movd xmm0, eax
  ret

with this:

Code Select

NameA equ Guga compare pattern    ; assign a descriptive name for each test
TestA proc
  mov ebx, AlgoLoops-1    ; loop e.g. 100x
  align 4
  .Repeat
    push 4
    push chr$("zero")
    push 100
    push offset somestring
    call SIMD_BinaryScan
    dec ebx
  .Until Sign?
  movd xmm0, eax
  ret

here are my results:

AMD Ryzen 5 2400G with Radeon Vega Graphics (SSE4)

5658 cycles for 100 * Guga compare pattern
8070 cycles for 100 * Masm32 len

5613 cycles for 100 * Guga compare pattern
8264 cycles for 100 * Masm32 len

5630 cycles for 100 * Guga compare pattern
8001 cycles for 100 * Masm32 len

5585 cycles for 100 * Guga compare pattern
8118 cycles for 100 * Masm32 len

5765 cycles for 100 * Guga compare pattern
8018 cycles for 100 * Masm32 len

535 bytes for Guga compare pattern
16 bytes for Masm32 len

4202622 = eax Guga compare pattern
100 = eax Masm32 len

--- ok ---

Btw, on your code, this:
push chr$("zero")
is the same as
push offset Somestring2 ; where SomeString2 = db "zero", 0 right ?

Can u please try also with a longer pattern? I mean, longer than 16 bytes and somestring be around 50 Kb (or something). I would like to see how the function behave on a larger dataset and a larger pattern

Title: Re: Instring
Post by: guga on September 20, 2024, 02:22:08 AM

Btw, if the speed is Ok and we have no more room for optimization (on SSE2), i´ll try to make another version that work backwards (from bottom to top). Those routines can be handy for a faster string and replace algorithm (Those that includes the direction to search, match cases, whole words, ignore chars etc)

Note:

I know that we could speed it a little bit in SIMD_BinaryScan making the 1st loop advance 1 byte and decreasing the pattern size by 2, but if i do this, the app won´t work if the pattern is only 1 byte, and in order to fix it i´ll have to do a work around specifically for patterns of only 1 char, which, at the end will make useless trying to gain more 8 to 10% of speed (because it will slow again, if i had to make a workaround). Example,on the 1st loop (when the pattern is bigger or equal to 16 bytes, we could simply do this:

Code Select


    mov eax esi | add eax D@PatternSize
    movzx eax B$eax-1 | SSE2_EXPAND_BYTE_FAST xmm3 eax ; xmm3 = extended last byte of the string xmm3 = Var_Last

    mov esi D@pDataBuffer | mov eax esi | add eax D@DataSize | sub eax 16 | mov D@StringEnd eax
    ..Do
(....)

                ; get first bit_set
                GET_FIRST_BIT_SET ebx | mov D@BitPos eax
                lea edi D$esi+eax+1
                call SIMD_ComparePattern edi, D@pSearchPattern2, D@PatternSize2

; where D@pSearchPattern2 = D@pSearchPattern + 1 byte. SO, points to the 2nd byte in pSearchPattern
; where D@PatternSize2 = D@PatternSize - 2 bytes

The above piece of code i tested before, and indeed i gained around 8 to 10% of speed, because it will compare a smaller pattern (It will compare the pattern, except the 1st and last byte, since it already matched). The problem is when the pattern is only 1 byte long (single char). So, i choose to keep the comparison on the full pattern, and remove all register preservations inside SIMD_ComparePattern (except for esi).Removing the preservation of registers (normal x86 and xmm) improved the speed, even though it could still be a bit more faster if i compared with a smaller pattern.

Text only | Text with Images

SMF 2.1.6 © 2025, Simple Machines