News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Instring

Started by guga, August 04, 2023, 02:30:50 AM

Previous topic - Next topic

jj2007

Quote from: Biterider on August 06, 2023, 08:52:20 PMAs it stands now, I strongly recommend not using these implementations for general purposes, as this will lead to unexplained and random crashes in customer code, resulting in a sh*storm.

I've used them for over a year in all my code, no crashes. HeapAlloc'ed strings and buffers are not a problem, in contrast to this:
  invoke VirtualAlloc, 0, 4096, MEM_COMMIT, PAGE_READWRITE
  xchg eax, edi
  Let esi=FileRead$("\Masm32\include\Windows.inc")
  invoke lstrcpyn, edi, esi, 4096 ; esi is a pointer e.g. to contents of Windows.inc
  Let sub2$="ACMFILTERCHOOSEHOOKPROC     typ" ; "typ" is ok, "type" fails with an exception
  Print Str$("The pos is %i\n", Instr_(edi, sub2$))

The solution is SEH.

Biterider

Hi JJ
It is only a matter of time before this code bites you.  :badgrin:

If the SEH handler has to catch this situation, what should it do? Fall back on the regular implementation?
Thinking about that, you have to protect every single call to these new procedures...
A higher level handler is also possible, but the complexity does not justify it.

Another side effect is that thousands and thousands of CPU cycles are wasted when triggering the SEH mechanism.

Still not a good solution for production code...

Biterider

jj2007

Quote from: Biterider on August 06, 2023, 10:29:37 PMIf the SEH handler has to catch this situation, what should it do? Fall back on the regular implementation?

Return zero = not found.

QuoteThinking about that, you have to protect every single call to these new procedures...

Once, at program start.

QuoteAnother side effect is that thousands and thousands of CPU cycles are wasted when triggering the SEH mechanism.

Nope, the timings are exactly the same if a match is found. If no match is found and the SEH is triggered, then you have a design problem.

  invoke VirtualAlloc, 0, 4096, MEM_COMMIT, PAGE_READWRITE
  xchg eax, edi
  Let esi=FileRead$("\Masm32\include\Windows.inc")
  invoke lstrcpyn, edi, esi, 4096 ; esi is a pointer e.g. to contents of Windows.inc
  Let sub32$="ACMFILTERCHOOSEHOOKPROC     type" ; type fails with an exception, "typ" is ok
  Let sub31$=Left$(sub32$, 31)
  Print Str$("For 31 chars, the pos is %i\n", Instr_(edi, sub31$))
  Print Str$("For 32 chars, the pos is %i\n", Instr_(edi, sub32$))
  Inkey "That was fun with SEH"

Output:
For 31 chars, the pos is 4058
Caught at pos:  $esi            def DW
For 32 chars, the pos is 0
That was fun with SEH

Biterider

#18
Quote from: jj2007 on August 06, 2023, 11:27:25 PM
QuoteIf the SEH handler has to catch this situation, what should it do? Fall back on the regular implementation?

Return zero = not found.

It is obviously wrong to always return 0 when you have a match, which leads to wrong results.

Quote from: jj2007 on August 06, 2023, 11:27:25 PM
QuoteThinking about that, you have to protect every single call to these new procedures...

Once, at program start.

You are talking about a TL-handler that has to deal with all kinds of exceptions coming from somewhere in your application.
This means that you have to analyse exactly where the exception occurred and what went wrong, and then unwind the whole thing to the offending procedure, which is a leaf procedure.

Quote from: jj2007 on August 06, 2023, 11:27:25 PM
QuoteAnother side effect is that thousands and thousands of CPU cycles are wasted when triggering the SEH mechanism.

Nope, the timings are exactly the same if a match is found. If no match is found and the SEH is triggered, then you have a design problem.

You misinterpreted my comment, I said when an exception is triggered. That comes from the transitions from user-land to ring 0 and vice-versa.

The whole thing is not about SEH, it is about code quality degradation.  :nie:
It is expected that when a proc that works perfectly fine is replaced by a better/faster version that it returns exactly the same results.
It makes no sense to replace it with a faster version that may work 99.9% but fails under some conditions the user has no control over.  :eusa_naughty:

For sure there are cases where a wrong result can be tolerated, but imagine you use this code in a critical process and this super fast routine fails 1/1000 times crashing a machine/car/rocket.
No one in their right mind would use it as a general purpose routine for a few microseconds that don't matter in the end.

I don't want to hurt anyone's feelings, but wouldn't it be better to find a better and suitable solution?  :wink2:

Biterider

jj2007

Hi Biterider, it seems I edited your post and made a mess out of it - apologies.


QuoteIt is obviously wrong to always return 0 when you have a match, which leads to wrong results.

Describe a situation where an exception was triggered but there still was a match.

QuoteYou are talking about a TL-handler that has to deal with all kinds of exceptions coming from somewhere in your application.
This means that you have to analyse exactly where the exception occurred and what went wrong, and then unwind the whole thing to the offending procedure, which is a leaf procedure.

You didn't check the attachment posted above.

Quote
Quote from: jj2007 on August 06, 2023, 11:27:25 PM
QuoteAnother side effect is that thousands and thousands of CPU cycles are wasted when triggering the SEH mechanism.

Nope, the timings are exactly the same if a match is found. If no match is found and the SEH is triggered, then you have a design problem.

You misinterpreted my comment, I said when an exception is triggered. That comes from the transitions from user-land to ring 0 and vice-versa.

Yes, we are wasting a millisecond or so in the extremely unlikely event of an exception.

Biterider

Quote from: jj2007 on August 07, 2023, 06:11:13 AMDescribe a situation where an exception was triggered but there still was a match.
Suppose you have an ANSI zero terminated string "abcABC" at the end of a page and the next page is not commited.
Now you start searching for "ABC" at a 16 byte unaligned address. You overshoot the page boundary by 9 bytes raising an exception and you should return a match result.

??? ? ? ? ? ? ? ? ? ? ? ? ? ?
? ? ? ? ? ? ? ? ? abcABC0
XXXXX X X X X X X X X X X X
X X X X X X X X X X X X X X X X

Biterider


jj2007

Well, you have a point there :thumbsup:

It's not so relevant for MasmBasic, because strings and buffers are all HeapAlloc'ed with a little reserve of 24 bytes that helps to avoid this scenario.

See Peter Cordes' very detailed description of pcmpistri & pcmpestri performance at SOF

I have a version that checks the bounds, and it's only a tick slower than the fast version, but I really don't think it's worth the effort. Real life strings and buffers should never end at no man's land.

Btw this logic should be applied to all code. Attached an example that shows the behaviour of the Masm32 SDK len() macro. You might be surprised :cool:

guga

Hi, JJ, i´m trying to create a benchmark test on a new BinarySearch/Instring function i did (specifically with SSE2 only), but i couldn´t be able to implement it on your benchmark app to test it. Can, u please input it on your bench algo so we can test the speed of it ?

It is designed to work with any pattern size. So, it can handle strings (or data chunks) from 1 to XXX bytes. Since i focused on speed, i didn´t implemented on the function an error check in case the pattern is smaller than the dataset or check if the size of the data/pattern are zeroed etc. The size of the dataset must be bigger or equal to 16 bytes long

I attached the masm compatible source code (the same syntax i used when testing on your benchmark app).

If you need explanations on usage here is The RosAsm code (With explanations of how the function works):

Main function SIMD_BinaryScan

;;
        SIMD_BinaryScan
            This function performs a binary search for a chunk of data (pSearchPattern) within another larger dataset (pDataBuffer).
            It uses SSE2 instructions for optimization and speed, especially with large data sets.
            The function can be used to find data from any size, including strings (case sensitive) on a big text

        Parameters:

            pDataBuffer (in):       Pointer to the start of the larger dataset in which the search is performed.
                                    This is the buffer in which you are searching.
            DataSize (in):          The size (in bytes) of the Data.
            pSearchPattern (in):    Pointer to the start of the Search Pattern (the data chunk to search for within the DataSet).
                                    This is the smaller data chunk you are searching for.
            PatternSize (in):       The size (in bytes) of the SearchPattern.


        Return Value:

            Success: If the Pattern is found, the function returns a pointer to the starting position of the matching chunk in the Dataset.
            Failure: Returns 0 (&NULL) if the Pattern is not found in the Dataset.

        Remarks:
            Main Function Flow:

            1 - Initialize First and Last Byte of the Pattern:
                The first and last bytes of the Pattern are expanded into SSE registers xmm2 and xmm3 for comparison against blocks of the DataSet.

            2 - Loop through Dataset Blocks:
                Each block of 16 bytes from the Dataset is loaded into xmm0 and xmm1 to compare the first and last bytes with the corresponding bytes of the pattern.
                SSE2 instructions (pcmpeqb, pand, pmovmskb) are used to generate masks and compare these blocks efficiently.

            3 - Check for Match:
                If matching bits are found, the position of the first matching byte is retrieved, and a detailed byte-by-byte comparison is done with
                SIMD_ComparePattern to confirm the match.
                If no match is found, the loop continues with the next block of data in the Dataset.

            4 - Handle Remainder:
                If the Dataset's size is not a multiple of 16, any remaining bytes are handled separately at the end of the Dataset.

            5 - Return Result:
                If the pattern is found, the function returns the position of the match; otherwise, it returns 0 (no match).

        Example of usage:
       
        [haystack: B$ "This is a simples string to search within. While trying to find sinp from sinples string", 0]
        [needle: B$ "search wi", 0]

            call 'FastCRT.StrLen' haystack
            mov ebx eax
            call 'FastCRT.StrLen' needle

            call SIMD_BinaryScan haystack, ebx, needle, eax
;;

Proc SIMD_BinaryScan:
    Arguments @pDataBuffer, @DataSize, @pSearchPattern, @PatternSize
    Local @BitPos, @StringEnd, @MaxPos
    Structure @XmmPreserve 80, @XMMReg0Dis 0, @XMMReg1Dis 16, @XMMReg2Dis 32, @XMMReg3Dis 48, @XMMReg4Dis 64
    Uses esi, ebx, edx, edi

    ; save the contents xmm registers to avoid altering them
    lea eax D@XMMReg0Dis | movdqu X$eax XMM0
    lea eax D@XMMReg1Dis | movdqu X$eax XMM1
    lea eax D@XMMReg2Dis | movdqu X$eax XMM2
    lea eax D@XMMReg3Dis | movdqu X$eax XMM3

    ; 1st get pointer to the string/data to be found
    mov esi D@pSearchPattern
    ; get the 1st char of the
    movzx eax B$esi | SSE2_EXPAND_BYTE2 xmm2 eax ; xmm2 = extended 1st byte of the string xmm2 = Var_First

    mov eax esi | add eax D@PatternSize
    movzx eax B$eax-1 | SSE2_EXPAND_BYTE2 xmm3 eax ; xmm3 = extended last byte of the string xmm3 = Var_Last

    mov esi D@pDataBuffer | mov eax esi | add eax D@DataSize | sub eax 16 | mov D@StringEnd eax
    ..Do
        movups XMM0 X$esi; xmm0 = Var_BlockFirst.
        mov eax D@PatternSize
        movups XMM1 X$esi+eax-1; xmm1 = Var_BlockLast.

        ; we need to create a mask formed by the comparition of the last bytes.
        ; Var_EqFirst = comparition between Var_First and Var_BlockFirst. The result is stored in xmm0
        pcmpeqb xmm0 xmm2   ; compare Var_First with previous Var_BlockFirst = Var_EqFirst

        ; Var_EqLast = comparition between Var_Last and Var_BlockLast. The result is stored in xmm1
        pcmpeqb xmm1 xmm3   ; compare Var_Last with previous Var_BlockLast = Var_EqLast

        pand xmm1 xmm0      ; Now we simply need to and those masks (Var_EqFirst and Var_EqLast). We store in xmm1 to preserve xmm0 to reuse below
        pmovmskb ebx XMM1   ; And finally create that mask and copy it onto eax. Maximum value = 0FFFF
        .Test_If_Not_Zero ebx
            Do
                ; get first bit_set
                GET_FIRST_BIT_SET ebx | mov D@BitPos eax
                lea edi D$esi+eax
                call SIMD_ComparePattern edi, D@pSearchPattern, D@PatternSize
                cmp eax 0 | jnz L2>>
               
                ;bsr ax bx | btr bx ax | jc L1< ; in practise, the carry flag will be settled only when ebx (bx) is not zero. So, the loop will works whenever ebx <> 0
                ; clear leftmost bit_set
                CLEAR_LEFTMOST_SET16 bx
            Repeat_Until_Zero
            ; now we store the last known bitpos where the string was not matched,
            ; and add it to the string size to get the next position to search
            mov eax D@BitPos | add eax D@PatternSize
        .Test_Else
            pmovmskb eax xmm0 ; And finally create that mask and copy it onto eax. Maximum value = 0FFFF
            Test_If_Not_Zero eax
                ; get last bit_set
                GET_LAST_BIT_SET eax
                inc eax ; The last pair of bits didn´ matched. Therefore, we need to increase only by 1 to do to the next position
            Test_Else
                mov eax 16
            Test_End
        .Test_End
        add esi eax
    ..Loop_Until esi > D@StringEnd

    ; Now handle remainder
    .If D@PatternSize <= 16
        mov esi D@StringEnd
        movups XMM0 X$esi; xmm0 = Var_BlockFirst.
        ; we need to create a mask formed by the comparition of the last bytes.
        ; Var_EqFirst = comparition between Var_First and Var_BlockFirst. The result is stored in xmm0
        pcmpeqb xmm0 xmm2; compare Var_First with previous Var_BlockFirst = Var_EqFirst
        pmovmskb ebx XMM0 ; Create that mask and copy it onto eax. Maximum value = 0FFFF
        .Test_If_Not_Zero ebx
            mov eax 16 | sub eax D@PatternSize | mov D@MaxPos eax
            Do
                ; get first bit_set
                GET_FIRST_BIT_SET ebx; | mov D@BitPos eax
                cmp eax D@MaxPos | jnbe L1> ; exit if data is not below or equal (So, jump if bigger than ...)
                lea edi D$esi+eax
                call SIMD_ComparePattern edi, D@pSearchPattern, D@PatternSize
                cmp eax 0 | jnz L2>
                 ; clear leftmost bit_set
                CLEAR_LEFTMOST_SET16 bx
            Repeat_Until_Zero
        .Test_End
    .End_If
L1:

    xor eax eax

L2:

    lea esi D@XMMReg0Dis | movdqu XMM0 X$esi
    lea edi D@XMMReg1Dis | movdqu XMM1 X$edi
    lea esi D@XMMReg2Dis | movdqu XMM2 X$esi
    lea edi D@XMMReg3Dis | movdqu XMM3 X$edi

EndP



Internal byte compare function

;;
    SIMD_ComparePattern

    This function, performs a detailed byte-by-byte comparison of two buffers. It is called once a potential
    match is found by the SSE2 block comparison.

    Parameters:

        pBuffer1 (in):  Pointer to the block in the haystack (loaded in esi).
        pBuffer2 (in):  Pointer to the needle (loaded in edi).
        length (in):    Length (in bytes) of the buffers to be compared.

    Return Value:

        Success: Returns the pointer to pBuffer1 if the contents match exactly.
        Failure: Returns 0 if the contents do not match.

    Remarks:
        This function is an adaptation of memcmp_SSE2, but to be used only inside SIMD_BinaryScan.
        It is also developed for speed, for this reason it don't preserve the registers (except esi)
        that are being used in SIMD_BinaryScan.

;;

Proc SIMD_ComparePattern:
    Arguments @pBuffer1, @pBuffer2, @Lenght
    Local @Reminder
    Uses esi; No need to preserve edi and others registers too since this function is used only inside SIMD_BinaryScan.

    ; Step1 - Check if the lenght is zeroed. If lenght is 0, jump over the whole function
    mov edx D@Lenght
    mov esi D@pBuffer1
    mov edi D@pBuffer2
    and edx 0-16 | mov eax edx | xor edx D@Lenght ; When 0 means lenght is divisible by 16 and we have no remainders, jmp over to the main function

    ; Step2
    .Test_If_Not_Zero eax ; edx = 0 ? no remainders exit
        mov D@Reminder edx
        ; We are working with multiples of 16 bytes, therefore, we can get rid of using shr eax 4 and directly subtract 16 bytes on each loop
        ; On this way we get a bit more of speed while keeping performance. Interesting is that add eax 0-16 seems a bit faster then sub eax 16
        .Do
            movdqu xmm0 X$esi
            movdqu xmm1 X$edi
            pcmpeqd xmm0 xmm1
            movmskps edx xmm0
            If edx <> 0F
                xor eax eax | jmp L4>>
            End_If
            add esi 16
            add edi 16
            add eax 0-16 ; subtract by 16 bytes. Faster then sub
        .Repeat_Until_Zero
        mov edx D@Reminder
    .Test_End

    .Test_If_Not_Zero edx

        ; Now becomes the trick. Here we will calculate the remainders of our previous routine stored in edx
        ; So, assuming our initial value was 43, we now have only 11 as remainder, which is 2 Dwords (8 bytes) + 3 bytes
        ; So we have 2 main blocks to check. One containing only the dwords checking. So, a value whose range is 0 to 12 (multiple of 4 only, so, 0, 4, 8, 12
        ; and other related to the final bytes (3), whose values are a maximum of 3, so 0, 1, 2, 3
        ; We will store the Dwords computation in dl and the Byte computation in dh
        ; This way is faster then we use shr eax 2 or something to divide the dword by 4 and later compute the other byte remainders

        mov dh dl
        and dl 0-4 ; dl = our Dword computation. So, values multiples of 4 only, on a maximum of 12
        xor dh dl ; dh = our byte computation. So, values are multples of 1 only, on a maximum of 3
        Test_If_Not_Zero dl
            ; If dl = 0, so if we have no dwords remainder, jmp over to compute the byte remainders
            mov eax D$esi | cmp eax D$edi | jz L1> | xor eax eax | jmp L4> | L1: | add esi 4 | add edi 4 | add dl 0-4 | jz L2>
            mov eax D$esi | cmp eax D$edi | jz L1> | xor eax eax | jmp L4> | L1: | add esi 4 | add edi 4 | add dl 0-4 | jz L2>
            mov eax D$esi | cmp eax D$edi | jz L1> | xor eax eax | jmp L4> | L1: | add esi 4 | add edi 4
        Test_End
        L2:

        ; check if we have no byte remainders
        Test_If_Not_Zero dh ; dh = our byte computation. So, values are multples of 1 only, on a maximum of 3
            mov eax 0 ; surprisinly a bit faster then xor eax eax
            mov dl B$esi | cmp dl B$edi | jnz L4> | dec dh | jz L2>
            mov dl B$esi+1 | cmp dl B$edi+1 | jnz L4> | dec dh | jz L2>
            mov dl B$esi+2 | cmp dl B$edi+2 | jnz L4>
            L2:
        Test_End
    .Test_End
    mov eax D@pBuffer1
L4:

EndP


Macros used:

;;
    SSE2_EXPAND_BYTE and SSE2_EXPAND_BYTE_FAST

        These macros replicates a single byte from a 32-bit register across all 16 bytes (128 bits) of an XMM register.
        They are similar to _mm_set1_epi8() in SSE2.

    SSE2_EXPAND_BYTE        Uses only SSE2 registers. So it preserves normal x86 registers while working.
    SSE2_EXPAND_BYTE_FAST   performs a similar task but uses an alternate method to replicate the byte by multiplying it by 0x01010101
                            before moving it to the XMM register. The only difference is that it changes the value of edx. So, if you intend
                            to use it, make sure to preserve edx before using it, in case you need edx in other parts of your code.

    SSE2_EXPAND_BYTE
        Explanation and Usage:
            movd #1, #2:        Move the 32-bit value in the normal register (#2) into the lower 32 bits of the XMM register (#1).
            punpcklbw #1, #1:   Unpack the bytes in the lower part of the XMM register into words, filling all 16 bits with the original byte.
            punpcklwd #1, #1:   Unpack the words into double words, expanding the byte across more space.
            pshufd #1, #1, 0:   Shuffle the four double words to fill all 16 bytes of the XMM register with the same byte.

        Example of usage:

            SSE2_EXPAND_BYTE xmm0 eax ; expands the byte in eax to fill xmm0.

    SSE2_EXPAND_BYTE_FAST
        Explanation and Usage:
            imul #2, #2, 01010101:          Multiply the value in the 32-bit register by 0x01010101, replicating the byte across the register.
            movd #1, #2:                    Move the replicated byte from the normal register (#2) to the XMM register (#1).
            pshufd #1, #1, SSE_FILL_DWORDS: Shuffle the double words to fill all 16 bytes in the XMM register with the byte.

        [SSE_FILL_DWORDS 0] ; equate used for shuffling XMM registers to fill double words (pshufd shuffle mask).

;;

[SSE_FILL_DWORDS 0]
[SSE2_EXPAND_BYTE | movd #1 #2 | punpcklbw #1 #1 | punpcklwd #1 #1 | pshufd #1 #1 0]
[SSE2_EXPAND_BYTE2 | imul #2 #2 01010101 | movd #1 #2 | pshufd #1 #1 SSE_FILL_DWORDS]

;;
    CLEAR_LEFTMOST_SET16
    This macro clears the left most distance bit on a 16 bit data.
    Parameters
      #1 (in/out)- A word size data or registers. Since the macro uses ax to compute, don´t use ax registers as input.

    Return value:
        The macro will return in the input the converted 16bit value. Also it will set the zero flag if it results 0

    Remarks:
        The macro uses ax register internaly. So on input you cannot use ax or you will get a incorrect value


    Explanation:
        mov ax, #1 – Move the value of the 16-bit register (like bx, cx, etc.) into ax.
        add #1, 0-1 – Subtract 1 from the value of the 16-bit register.
        and #1, ax – Perform a bitwise AND between the original value and the decremented value (after subtracting 1).


    Example of usage:

        CLEAR_LEFTMOST_SET16 bx

        Will unroll to: mov ax bx | add bx 0-1 | and bx ax
;;

[CLEAR_LEFTMOST_SET16 | mov ax #1 | add #1 0-1 | and #1 ax]

;;
    GET_FIRST_BIT_SET
   
    This macro locates the position of the first (lowest) set bit in the given register.

    Macro Code:
        [GET_FIRST_BIT_SET | bsf eax, #1]

    Explanation and Usage:

        bsf eax, #1: Bit Scan Forward—find the position of the first set bit in #1 and store the result in eax.

    Finds the position of the first set bit (least significant bit) in the given register.
    Example: GET_FIRST_BIT_SET ebx ; stores the position of the first set bit in ebx into eax.
;;


[GET_FIRST_BIT_SET | bsf eax #1]

;;
    GET_LAST_BIT_SET

        This macro locates the position of the last (highest) set bit in the given register.

    Macro Code:

        [GET_LAST_BIT_SET | bsr eax, #1]

    Explanation and Usage:

        bsr eax, #1: Bit Scan Reverse—find the position of the last set bit in #1 and store the result in eax.

    Finds the position of the last set bit (most significant bit) in the given register.
    Example: GET_LAST_BIT_SET ebx ; stores the position of the last set bit in ebx into eax.
;;

[GET_LAST_BIT_SET | bsr eax #1]
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

jj2007

Hi Guga,

Attached the source. There is a problem with the parameters, though. Are you sure the sub esp, 4 is correct?

Jochen

guga

#24
Hi JJ

tks you a lot.

Yes, i´m sure.

The problem you found is because you called the internal function and not the main function.

The main function to use is: SIMD_BinaryScan (Uses 4 parameters)
While SIMD_ComparePattern (Uses 3 parameters) is used only inside the SIMD_BinaryScan without preserving any registers (That are reused by the main function) and cannot be used alone (outside that function). The SIMD_ComparePattern function is a adaptation i made for the previous memcmp_SSE2 i posted here https://masm32.com/board/index.php?topic=12256.msg133369#msg133369

Replace this:

NameA equ Guga compare pattern    ; assign a descriptive name for each test
TestA proc
  mov ebx, AlgoLoops-1    ; loop e.g. 100x
  align 4
  .Repeat
    push 4
    push chr$("zero")
    push 100
    push offset somestring
    call SIMD_ComparePattern
    dec ebx
  .Until Sign?
  movd xmm0, eax
  ret


with this:

NameA equ Guga compare pattern    ; assign a descriptive name for each test
TestA proc
  mov ebx, AlgoLoops-1    ; loop e.g. 100x
  align 4
  .Repeat
    push 4
    push chr$("zero")
    push 100
    push offset somestring
    call SIMD_BinaryScan
    dec ebx
  .Until Sign?
  movd xmm0, eax
  ret


here are my results:

AMD Ryzen 5 2400G with Radeon Vega Graphics    (SSE4)

5658    cycles for 100 * Guga compare pattern
8070    cycles for 100 * Masm32 len

5613    cycles for 100 * Guga compare pattern
8264    cycles for 100 * Masm32 len

5630    cycles for 100 * Guga compare pattern
8001    cycles for 100 * Masm32 len

5585    cycles for 100 * Guga compare pattern
8118    cycles for 100 * Masm32 len

5765    cycles for 100 * Guga compare pattern
8018    cycles for 100 * Masm32 len

535    bytes for Guga compare pattern
16      bytes for Masm32 len

4202622 = eax Guga compare pattern
100    = eax Masm32 len

--- ok ---

Btw, on your code, this:
push chr$("zero")
is the same as
push offset Somestring2 ; where SomeString2 = db "zero", 0  right ?


Can u please try also with a longer pattern? I mean, longer than 16 bytes and somestring be around 50 Kb (or something). I would like to see how the function behave on a larger dataset and a larger pattern
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

#25
Btw, if the speed is Ok and we have no more room for optimization (on SSE2), i´ll try to make another version that work backwards (from bottom to top). Those routines can be handy for a faster string and replace algorithm (Those that includes the direction to search, match cases, whole words, ignore chars etc)

Note:

I know that we could speed it a little bit in SIMD_BinaryScan making the 1st loop advance 1 byte and decreasing the pattern size by 2, but if i do this, the app won´t work if the pattern is only 1 byte, and in order to fix it i´ll have to do a work around specifically for patterns of only 1 char, which, at the end will make useless trying to gain more 8 to 10% of speed (because it will slow again, if i had to make a workaround). Example,on the 1st loop (when the pattern is bigger or equal to 16 bytes, we could simply do this:


    mov eax esi | add eax D@PatternSize
    movzx eax B$eax-1 | SSE2_EXPAND_BYTE_FAST xmm3 eax ; xmm3 = extended last byte of the string xmm3 = Var_Last

    mov esi D@pDataBuffer | mov eax esi | add eax D@DataSize | sub eax 16 | mov D@StringEnd eax
    ..Do
(....)

                ; get first bit_set
                GET_FIRST_BIT_SET ebx | mov D@BitPos eax
                lea edi D$esi+eax+1
                call SIMD_ComparePattern edi, D@pSearchPattern2, D@PatternSize2

; where D@pSearchPattern2 = D@pSearchPattern + 1 byte. SO, points to the 2nd byte in pSearchPattern
; where D@PatternSize2 = D@PatternSize - 2 bytes

The above piece of code i tested before, and indeed i gained around 8 to 10% of speed, because it will compare a smaller pattern (It will compare the pattern, except the 1st and last byte, since it already matched). The problem is when the pattern is only 1 byte long (single char). So, i choose to keep the comparison on the full pattern, and remove all register preservations inside SIMD_ComparePattern (except for esi).Removing the preservation of registers (normal x86 and xmm) improved the speed, even though it could still be a bit more faster if i compared with a smaller pattern.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com