Text only | Text with Images

The MASM Forum

General => The Laboratory => Topic started by: guga on March 17, 2025, 05:27:34 PM

Title: SIMD_BinaryScan (Instring equivalent)
Post by: guga on March 17, 2025, 05:27:34 PM

Hi Guys

Updated (24/03/2025):
Sometime ago i started creating a fast Instring function to be used as described at here (https://masm32.com/board/index.php?topic=11088.msg133567#msg133567)

The function is a pattern compare that later i´ll use it as a InString (IN fact, it can be used already as such). Recently i found small issue in a test i was doing where it mismatched one of the patterns, so one pattern was not found on a string, but it was there.

Can someone help me to show your results and also if one of the cases tested failed to identify the pattern ?

Updated Post:
RosAsm code

Code Select

Proc SIMD_BinaryScan:
    Arguments @pDataBuffer, @DataSize, @pSearchPattern, @PatternSize
    Local @FirstBitPos, @ScanLimit, @StringEnd, @pFunction
    Structure @XmmPreserve 80, @XMMReg0Dis 0, @XMMReg1Dis 16, @XMMReg2Dis 32, @XMMReg3Dis 48, @XMMReg4Dis 64
    Uses esi, ebx, edi, edx

    ; Preserve XMM registers efficiently
    movdqu X@XMMReg0Dis XMM0
    movdqu X@XMMReg1Dis XMM1
    movdqu X@XMMReg2Dis XMM2
    movdqu X@XMMReg3Dis XMM3

    mov esi D@DataSize

    ; edi = DataSize
    ; edx = Bytes for ScanLimit
    ; ebx = PatternSize
    ; Security: Pattern size cannot be bigger than data size, and it cannot be zero. Also we precalculate ScanLimit Bytes
    ...If_Or esi < D@PatternSize, D@PatternSize = 0
        mov eax 0-1 ; Return -1 for invalid input (PatternSize > DataSize or 0)
    ...Else

        ; Initialize pattern and buffer pointers
        mov eax D@pDataBuffer
        mov edi D@DataSize | lea eax D$eax+edi | mov D@StringEnd eax ; StringEnd = pDataBuffer + DataSize

        mov esi D@pSearchPattern
        mov ebx D@PatternSize

        ; Precompute pattern bytes for SIMD comparison
        movzx eax B$esi | SSE2_EXPAND_BYTE XMM2 eax                         ; First byte in XMM2
        lea eax D$esi+ebx-1 | movzx eax B$eax | SSE2_EXPAND_BYTE XMM3 eax   ; Last byte in XMM3

        ; Select verification function based on PatternSize
        mov D@pFunction ComparePattern16ByteBlocks  ; Default for PatternSize > 16
        ; Check if PatternSize >= 16 (shift by 4 = divide by 16)
        ; If >= 16, keep ComparePattern16ByteBlocks
        ; Otherwise, use byte/dword scalar comparison
        mov eax ebx | shr eax 4 | jnz L1> | mov D@pFunction ComparePatternBytes | L1:

        mov esi D@pDataBuffer

        ; Handle small datasets (< 32 bytes)
        ; edi = DataSize / 32
        ; If >= 32, proceed to main loop
        mov eax edi | shr edi 5 | jnz L1> ; (DataSize/32)
            ; eax = DataSize - PatternSize
            ; Check if remaining size fits 16-byte SSE register
            ; if data is smaller then 32 bytes we need to see if the pattern fits the SSe register size limit (16 bytes)
            sub eax ebx | shr eax 4 | jnz L1> ; (DataSize-PatternSize)/16 . If >= 16, use main loop
                call ScanSmallSSE2 esi, D@DataSize, D@pSearchPattern, D@PatternSize
                jmp L2>>
        L1:

        ; Compute ScanLimit for main loop (aligns to 16-byte boundary)

        ; Calculate ScanLimit
        mov eax D@DataSize      ; eax = DataSize
        and eax 0-16            ; eax = DataSize & 0xFFFFFFF0 . Align DataSize down to multiple of 16
        lea eax D$esi+eax+1     ; eax = esi + (DataSize & 0xFFFFFFF0) + 1 . ScanLimit = pDataBuffer + aligned DataSize + 1
        mov D@ScanLimit eax     ; Store result in ScanLimit

        ; Main loop: Process the dataset in blocks of 16 bytes. ebx = PatternSize. The register must not be changed
        .Do
            mov D@FirstBitPos 0
            movdqu XMM0 X$esi ; Load the first 16 bytes of the current block into xmm0. Var_BlockFirst.
            movdqu XMM1 X$esi+ebx-1 ; Load the last 16 bytes of the current block into xmm1. Var_BlockLast.

            pcmpeqb xmm0 xmm2 ; Compare the first byte of the pattern with the first byte of the block.
            pcmpeqb xmm1 xmm3 ; Compare the last byte of the pattern with the last byte of the block.

            ; Combine the masks to find positions where both first and last bytes match.
            pand xmm1 xmm0      ; xmm1 = combined mask (Var_EqFirst AND Var_EqLast).
            ; Generate a bitmask from the combined mask.
            pmovmskb edx XMM1   ; ebx = bitmask of matching positions. Maximum value = 0FFFF

            ; Check if there are any matches.
            .Test_If_Not_Zero edx
                ; Process each matching position.
                Do
                    ; Find the first set bit in the bitmask.
                    GET_FIRST_BIT_SET edx | mov D@FirstBitPos eax
                    ; Calculate the address of the matching position.
                    lea edi D$esi+eax ; edi = potential match address
                    ; Perform a detailed comparison of the pattern at this position.
                    call D@pFunction edi, D@pSearchPattern, D@PatternSize
                    cmp eax 0 | jnz L2>>    ; If a match is found, return the position.
                    ; Clear the leftmost set bit in the bitmask and save the result in dx.
                    CLEAR_LEFTMOST_SET16 dx ; Clear processed bit
                Repeat_Until_Zero   ; perfom a loop if dx is not zero.
                ; Update the position to continue searching.
                mov eax D@FirstBitPos | add eax D@PatternSize   ; Advance past current pattern
            .Test_Else
                ; Check first byte matches to optimize step
                ; If no matches, find the last set bit in the mask and advance the pointer.
                pmovmskb eax xmm0 ; eax = bitmask of first byte matches. Maximum value = 0FFFF
                Test_If_Not_Zero eax
                    ; Find the last set bit.
                    GET_LAST_BIT_SET eax    ; Find last first-byte match
                    inc eax ; Advance to the next position.
                            ; We found at least 1 byte matching, increase the position of the last byte that matched by 1
                Test_Else
                    mov eax 16  ; No matches, step full 16 bytes
                Test_End
            .Test_End
            add esi eax ; Advance buffer pointer

            ; Check if remaining size is too small for pattern
            mov edi D@StringEnd | sub edi esi
            If edi < D@PatternSize
                xor eax eax | jmp L2>>
            End_If

            lea eax D$esi+15        ; Prepare next block check
        .Loop_Until eax > D@ScanLimit

        ; Handle tail (remaining bytes < 16)
        mov eax D@StringEnd | sub eax esi
        If eax => D@PatternSize
            call ScanSmallSSE2 esi, eax, D@pSearchPattern, D@PatternSize
        Else
            xor eax eax
        End_If

    ...End_If

L2:

    ; Restore XMM registers
    movdqu XMM0 X@XMMReg0Dis
    movdqu XMM1 X@XMMReg1Dis
    movdqu XMM2 X@XMMReg2Dis
    movdqu XMM3 X@XMMReg3Dis

EndP

Helper functions

Code Select

Proc ComparePattern16ByteBlocks:
    Arguments @pDataBuffer, @pSearchPattern, @PatternSize
    Local @Reminder
    Uses esi, edx

    ; Step 1: Check if the length is zero. If so, skip the function.
    mov edx D@PatternSize
    mov esi D@pDataBuffer
    mov edi D@pSearchPattern
    ; Compute multiples of 16 and remainder
    ;and edx 0-16 | mov eax edx | xor edx D@PatternSize ; When 0 means lenght is divisible by 16 and we have no remainders.
    mov eax edx
    and eax 0-16                ; eax = multiples of 16 (aligned down)
    and edx (16-1)              ; edx = remainder (0-15)

    ; Process 16-byte blocks with SIMD
    .Test_If_Not_Zero eax ; edx = 0 ? no remainders exit
        mov D@Reminder edx ; Preserve remainder
        ; We are working with multiples of 16 bytes, therefore, we can get rid of using shr eax 4 and directly subtract 16 bytes on each loop
        ; On this way we get a bit more of speed while keeping performance. Interesting is that add eax 0-16 seems a bit faster then sub eax 16
        .Do
            movdqu xmm0 X$esi   ; Load 16 bytes from data
            movdqu xmm1 X$edi   ; Load 16 bytes from pattern
            pcmpeqd xmm0 xmm1   ; Compare 4 dwords (16 bytes) at once
            movmskps edx xmm0   ; Extract 4-bit mask (1 per dword)
            cmp edx 0F | jnz @NoMatch ; All 4 dwords must match
            add esi 16
            add edi 16
            add eax 0-16 ; subtract by 16 bytes. Faster then sub eax 16
        .Repeat_Until_Zero
        mov edx D@Reminder ; Restore remainder
    .Test_End

    ; Handle remainder (0-15 bytes, if any). When eax is also 0, we skip this and exit the D@pDataBuffer.
    ; Why ? Because on the previous test, eax was never 0
    .Test_If_Not_Zero edx ; Check if edx is not 0
        ; Split PatternSize into dwords (dl) and bytes (dh)
        mov dh dl
        and dl 0-4 ; dl = our Dword computation. So, values multiples of 4 only, on a maximum of 12 ; dl = dwords * 4 (0, 4, 8, 12)
        xor dh dl ; dh = our byte computation. So, values are multples of 1 only, on a maximum of 3 ; ; dh = bytes (0, 1, 2, 3)
        Test_If_Not_Zero dl ; Compare DWORD remainders.
            ; If dl = 0, so if we have no dwords remainder, jmp over to compute the byte remainders
            mov eax D$esi | cmp eax D$edi | jnz @NoMatch | add esi 4 | add edi 4 | add dl 0-4 | jz @CheckBytes
            mov eax D$esi | cmp eax D$edi | jnz @NoMatch | add esi 4 | add edi 4 | add dl 0-4 | jz @CheckBytes
            mov eax D$esi | cmp eax D$edi | jnz @NoMatch | add esi 4 | add edi 4
        Test_End
@CheckBytes:
        ; check if we have no byte remainders
        Test_If_Not_Zero dh ; dh = our byte computation. So, values are multples of 1 only, on a maximum of 3
            mov al B$esi | cmp al B$edi | jnz @NoMatch | dec dh | jz @Match
            mov al B$esi+1 | cmp al B$edi+1 | jnz @NoMatch | dec dh | jz @Match
            mov al B$esi+2 | cmp al B$edi+2 | jnz @NoMatch
        Test_End
    .Test_End
@Match:
    ; If the pattern matches, return the starting position.
    mov eax D@pDataBuffer
    ExitP

@NoMatch:
    ; No matches, return 0
    xor eax eax


EndP

Code Select

Proc ComparePatternBytes:
    Arguments @pDataBuffer, @pSearchPattern, @PatternSize
    Uses esi, edx

    ; Step 1: Check if the length is zero. If so, skip the function.
    mov edx D@PatternSize
    mov esi D@pDataBuffer
    mov edi D@pSearchPattern
    ; Split PatternSize into dwords (dl) and bytes (dh)
    mov dh dl       ; dh = PatternSize copy
    and dl 0-4      ; dl = our Dword computation (0, 8, 8, 12). So, values multiples of 4 only, on a maximum of 12
    xor dh dl       ; dh = our byte computation (0, 1, 2, 3). So, values are multples of 1 only, on a maximum of 3

    Test_If_Not_Zero dl ; Compare DWORD remainders. (up to 12 bytes)
        ; If dl = 0, so if we have no dwords remainder, jmp over to compute the byte remainders
        mov eax D$esi | cmp eax D$edi | jnz @NoMatch | add esi 4 | add edi 4 | add dl 0-4 | jz @CheckBytes
        mov eax D$esi | cmp eax D$edi | jnz @NoMatch | add esi 4 | add edi 4 | add dl 0-4 | jz @CheckBytes
        mov eax D$esi | cmp eax D$edi | jnz @NoMatch | add esi 4 | add edi 4
    Test_End

@CheckBytes:
    ; check if we have no byte remainders (0-3)
    Test_If_Not_Zero dh ; dh = our byte computation. So, values are multples of 1 only, on a maximum of 3
        mov al B$esi | cmp al B$edi | jnz @NoMatch | dec dh | jz @Match
        mov al B$esi+1 | cmp al B$edi+1 | jnz @NoMatch | dec dh | jz @Match
        mov al B$esi+2 | cmp al B$edi+2 | jnz @NoMatch
    Test_End

@Match:
    ; If the pattern matches, return the starting position.
    mov eax D@pDataBuffer
    ExitP

@NoMatch:
    ; No matches, return 0
    xor eax eax

EndP

Code Select

Proc ScanSmallSSE2:
    Arguments @pDataBuffer, @DataSize, @pSearchPattern, @PatternSize
    Local @StringEnd, @FirstBitPos, @ScanLimit, @pFunction

    ; Early exit for tiny datasets (<= 8 bytes)
    ..If D@DataSize <= 8
        call CompareSmallDataBlock esi, D@DataSize, D@pSearchPattern, D@PatternSize
    ..Else
        ; Validate inputs
        mov esi D@pDataBuffer
        mov edi D@DataSize | lea eax D$esi+edi | mov D@StringEnd eax
        mov ebx D@PatternSize

        ; Calculate ScanLimit
        mov eax edi             ; eax = DataSize
        and eax 0-8             ; eax = DataSize & 0xFFFFFFF0
        lea eax D$esi+eax+1     ; eax = esi + (DataSize & 0xFFFFFFF0) + 1
        mov D@ScanLimit eax     ; Store result in ScanLimit

        ; Select comparison function based on pattern size. For situations where pattern size is betwen 16 and 31
        mov D@pFunction ComparePatternBytes ; Default for PatternSize <= 15
        If ebx >= 16
            mov D@pFunction ComparePattern16ByteBlocks  ; For PatternSize 16-31
        End_If

        ; Main loop: Process the dataset in blocks of 16 bytes. ebx = PatternSize. The register must not be changed
        .Do
            mov D@FirstBitPos 0
            movq XMM0 X$esi ; Load the first 16 bytes of the current block into xmm0. Var_BlockFirst.
            movq XMM1 X$esi+ebx-1 ; Load the last 16 bytes of the current block into xmm1. Var_BlockLast.

            pcmpeqb xmm0 xmm2 ; Compare the first byte of the pattern with the first byte of the block.
            pcmpeqb xmm1 xmm3 ; Compare the last byte of the pattern with the last byte of the block.

            ; Combine the masks to find positions where both first and last bytes match.
            pand xmm1 xmm0      ; xmm1 = combined mask (Var_EqFirst AND Var_EqLast).
            ; Generate a bitmask from the combined mask.
            pmovmskb edx XMM1   ; ebx = bitmask of matching positions. Maximum value = 0FFFF

            ; Check if there are any matches.
            .Test_If_Not_Zero edx
                ; Process each matching position.
                Do
                    ; Find the first set bit in the bitmask.
                    GET_FIRST_BIT_SET edx | mov D@FirstBitPos eax
                    ; Calculate the address of the matching position.
                    lea edi D$esi+eax
                    ; Perform a detailed comparison of the pattern at this position.
                    call D@pFunction edi, D@pSearchPattern, D@PatternSize
                    cmp eax 0 | jnz L2>>    ; If a match is found, return the position.
                    ; Clear the leftmost set bit in the bitmask and save the result in dx.
                    CLEAR_LEFTMOST_SET16 dx
                Repeat_Until_Zero   ; perfom a loop if dx is not zero.
                ; Update the position to continue searching.
                mov eax D@FirstBitPos | add eax D@PatternSize
            .Test_Else
                ; If no matches, find the last set bit in the mask and advance the pointer.
                pmovmskb eax xmm0 ; eax = bitmask of first byte matches. Maximum value = 0FFFF
                Test_If_Not_Zero eax
                    ; Find the last set bit.
                    GET_LAST_BIT_SET eax
                    inc eax ; Advance to the next position.
                            ; We found at least 1 byte matching, increase the position of the last byte that matched by 1
                Test_Else
                    mov eax 8  ; No matches, advance by 8 bytes.
                Test_End
            .Test_End
            add esi eax ; Advance the dataset pointer.

            ; Check remaining size
            mov eax D@StringEnd | sub eax esi
            If eax < D@PatternSize
                xor eax eax | ExitP
            End_If

            lea eax D$esi+8
        .Loop_Until eax > D@ScanLimit

        ; Handle remaining tail
        mov eax D@StringEnd | sub eax esi
        If eax => D@PatternSize
            call CompareSmallDataBlock esi, eax, D@pSearchPattern, D@PatternSize
        Else
            xor eax eax
        End_If
    ..End_If
L2:

EndP

Code Select

Proc CompareSmallDataBlock:
    Arguments @pDataBuffer, @DataSize, @pSearchPattern, @PatternSize
    Local @EndString
    Uses esi, edi, edx, ecx

    ; Initialize pointers.
    mov esi D@pDataBuffer
    mov eax D@DataSize
    ; EndString = pDataBuffer + DataSize - PatternSize
    lea eax D$esi+eax | sub eax D@PatternSize | mov D@EndString eax

    mov edi D@pSearchPattern
    .Do
        ; Compare the pattern byte-by-byte.
        mov ecx D@PatternSize
        xor eax eax
        Do
            movzx edx B$esi+eax | cmp dl B$edi+eax | jne L1>
            inc eax
            dec ecx
        Repeat_Until_Zero
        ; If the pattern matches, return the starting position.
        mov eax esi
        ExitP

    L1:
        ; Move to the next position in the dataset.
        inc esi
    .Loop_Until esi > D@EndString

    ; If no match is found, return 0.
    xor eax eax

EndP

Macros used:

Code Select

[CLEAR_LEFTMOST_SET16 | mov ax #1 | add #1 0-1 | and #1 ax]
[GET_FIRST_BIT_SET | bsf eax #1]
[GET_LAST_BIT_SET | bsr eax #1]
[SSE2_EXPAND_BYTE | movd #1 #2 | punpcklbw #1 #1 | punpcklwd #1 #1 | pshufd #1 #1 0]

Masm translation (I hope it is ok, but if not, i included the disassembled source that works - just rename xmmword with oword or other compatible masm token for SSE registers)
New Files Atacched (24/03/2025) with comments (See RosAsmCode_Info.txt)

Bug fixes:

Fixed an error in one of the helper functions that was causing the pattern to not found properly.
Fixed security to guarantee that the data stays within the limits of a SSE2 register (16 bytes per block)
Improved speed for Match and Non-Match cases. Now both are faster
Added more helpers to handle patterns and data sizes smaller than 32 and 16 bytes to gain performance in small data chunks.
Added comments on all functions.

Note: For SSE2 with x86, i guess i reached my limit on speed the algorithm. In theory, it could be improved a little bit forcing the algo to use the others registers (xmm4 to xmm7) to also look for the 2nd and penultimate byte from the pattern, or even creating now a flag to handle when the pattern is only 2 bytes, so it won´t enter the core function since it already found the matching pair etc etc, but it will result on a more complex code and i´m not sureif the performance will vary that much.

In any case, if you want to test to see if it is possible to optimize it even further (obbeying the limits etc), you can see an alternative solution here at post #31 (https://masm32.com/board/index.php?topic=12630.30)

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: zedd on March 18, 2025, 01:50:20 AM

Hi guga!

Too many examples in the test. The entirety of the results was greater than the 20000 character limit per post, so I could not post it all.
Here is the Min/Max results:

Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz 3.60 GHz

Code Select

_______________________________________________

Match Min Max Timmings:
_______________________________________________

Slowest Match cases

Clock cycles: 94 clocks
Example Case: 160

Fastest Match cases

Clock cycles: 38 clocks
Example Case: 3
_______________________________________________

Not Found Min Max Timmings:
_______________________________________________

Slowest Match cases

Clock cycles: 55 clocks
Example Case: 135

Fastest Match cases

Clock cycles: 19 clocks
Example Case: 48

Maybe use less number of examples, so all of the results can be posted?

Additionally...
If 'CharToFind' length is greater than the 'TestingString' length, the test should not even be run.
You should test the string lengths against each other before performing the tests, imo.
Then if the 'CharToFind' length is greater then the 'TestingString' length display error message?

Example:

Code Select

Clock cycles: 19 clocks

TestingString: abcdef1234567890pqrstuvabcd1234567890abcdefgxyz789abcd1234567890abcdefgxyz789abcd1234567890abcdefgxyz789abcd1234567890abcdefgxyz789abcd1234567890
TestingString Size: 145
CharToFind: abcdef1234567890pqrstuvabcd1234567890abcdefgxyz789abcd1234567890abcdefgxyz789abcd1234567890abcdefgxyz789abcd1234567890abcdefgxyz789abcd1234567890gugaxy789abcdefxyz789abcd1234567890abcdefgxyz789abcd1234567890
CharToFind Size: 207
Found at: Pos (0) - Not found

Obviously the 'CharToFind' will never be found, it is longer than the 'TestingString'. :smiley:

Also... 'CharToFind' maybe should be named 'StringToFind' instead?

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on March 18, 2025, 04:26:25 AM

Hi Zedd

Many Tks

I´ll try using less examples. I just created several of them with different lenght with odd, even and prime numbers sizes to see if the function will fail.

About the cases where the pattern is smaller than the Dataset, the function already checks this at the very beginning, i just included on the tests to see if it was not bypassing the security routines i settled for that. Those situations are checked twice inside the function.
1st - When the data input is smaller than the pattern and the function immediately exists
2nd - If the Dataset is bigger than the pattern, but on a specific point the remainder bytes to be scanned are smaller than the pattern. (Of course, after the previous bytes where correctly checked as well).

I´m optimizing it a bit more and suceedd to remove some uneeded checks in one of it´s internal functions. I´ll try one more thing and check if the results are being returned without any mismatch before converting it to masm and post it here for us. If this small fix was succeeded, i did gained something around 20% to 40% of speed

Code Select

_______________________________________________


Match Min Max Timmings:
_______________________________________________

Slowest Match cases

Clock cycles: 68 clocks
Example Case: 47

Fastest Match cases

Clock cycles: 42 clocks
Example Case: 2
_______________________________________________

Not Found Min Max Timmings:
_______________________________________________

Slowest Match cases

Clock cycles: 48 clocks
Example Case: 135

Fastest Match cases

Clock cycles: 18 clocks
Example Case: 48


Press enter to exit...

About changing the label name to StringToFind. Sure, it was just to display something on screen to identify what was being done. I´ll rename it on the next tests before convert this function to masm

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: zedd on March 18, 2025, 05:25:04 AM

Okay, guga. I just thought I would mention those, just in case that they might have been overlooked or missed...

I'll look for the next test when it becomes available.

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on March 18, 2025, 10:14:29 AM

Ok, guys, new test. This time, on the print, i excluded on the display the cases where the mismatch is obvious (Pattern bigger than dataset). Of course, it didn´t affected the function at all...i just removed it from the print.

On this way we can check if the results are ok, since we have less data on the print to check. Also, i changed a bit the function and made it a bit faster. I hope this gain of speed is really due to performance and elimination of redundancies of the code and not due to mismatch on the result that could affect the speed.

If the results of this test are ok, i´ll port it to masm. Pls let me know so i can start porting it and inserting the proper comments on the function.

Btw...about the name of the app "TestingResultDeletar2a", well..it was just a dumy name for this test. I was about to delete it (that´s why i named it as "deletar" - portuguese for "to delete") if someting was wrong, but since i didn´t find wrong results so far, i kept this dummy name.

Btw..the function was created to respect the size limits of a x86 SSE2 register, so the goal was not to extrapolate it ever. On my tests, it seems to work as expected. If it really correct, then when i port it to masm, you can check for yourselfs too if any other error case do exists on larger datasets.

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: zedd on March 18, 2025, 10:17:28 AM

Again the results exceed the 20000 chacter limit for posting it, in its entirety into a post. Full results attached in zip file
Min Max:

Code Select

Match Min Max Timmings:
_______________________________________________

Slowest Match cases

Clock cycles: 65 clocks
Example Case: 47

Fastest Match cases

Clock cycles: 39 clocks
Example Case: 23
_______________________________________________

Not Found Min Max Timmings:
_______________________________________________

Slowest Match cases

Clock cycles: 55 clocks
Example Case: 135

Fastest Match cases

Clock cycles: 31 clocks
Example Case: 19


Press enter to exit...

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on March 18, 2025, 10:50:09 AM

Hi Zedd

tks, but the result is empty. I guess you forgot to paste the result on the text file.

But..did you checked to see if we have any mismatch ? I mean if the function failed to find the proper patterns ?

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: zedd on March 18, 2025, 10:55:05 AM

Quote from: guga on March 18, 2025, 10:50:09 AMtks, but the result is empty. I guess you forgot to paste the result on the text file.

And I forgot to test the zip file. That was indeed my fault, sorry.

I have corrected that and reattached it above, where it was originally.

No, I have not inspected that the results were accurate that will take a little time... I'll have to get back to you in a little while.

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: zedd on March 18, 2025, 10:59:23 AM

I think I found a couple errors in the position where string was found.
I'll make a report for each case where an error found....

It'll take some time.

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on March 18, 2025, 11:08:27 AM

Tks, Btw...about the position...its´rather an index than the position. For example, on the 1st case

TestingString: 578987hGugf8ahj1g1ghdlgjGughjkhjk8ahj18ahj1g1ghdmerdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj
TestingString Size: 90
CharToFind: merdalgjg1ghdlgj8a8ahj1g1
CharToFind Size: 25
Found at: Pos (48) - merdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj

The value of 48 means we have 48 bytes before the match. Therefore the pattern was located at the 49th byte. If needed i´ll redo the printing to reflect the exact position where the pattern was found.

"578987hGugf8ahj1g1ghdlgjGughjkhjk8ahj18ahj1g1ghd" = 48 bytes (Pos)

found it at the 49th byte: merdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj

When i did the printf i simply calculated the returned address in eax and subtracted from the beginning of the data.

esi = Initial TestingString address

eax = Pattern address found

Pos = eax - esi

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: zedd on March 18, 2025, 11:12:34 AM

Okay, I will reexamine the results. I could have made a mistake too, when I looked the first time.

I am going to be away from computer for some time. It's beautiful outside, I keep going back in to the computer, and back outside...

Give me about an hour to finish checking each result properly. :smiley:
So I can enjoy the weather a little longer. :azn:

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: zedd on March 18, 2025, 11:39:07 AM

I came back inside early. I didn't want to make you wait too long... :smiley:

Okay. Indeed I did make an error when saying that "there was an error in the position where the string is found".
My mistake. The couple errors that I thought I found were the second or more occurrence of the string, not the first. The position listed is for the first occurrence, even if multiple occurrences are there.

I tested the first 40 Examples out of the 167 Examples tested. Almost 25%

The results of 38 of those are all correct as well as the positions of where that first occurence of each string is found, the 2 Examples where the string is not found the 'not found' result is also correct (Example 19 and Example 25).

I didn't go any further as I think I will not find any errors. It does indeed appear that there are no errors in the results. :thumbsup:

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on March 18, 2025, 03:22:03 PM

HI guys, i updated the 1st post and included the translation to masm of the last working version. I attached the file that i had to disassemble (as did before in other threads) which is more guaranteed to work (Just rename the xmmword to oword if needed). Additionally, i added on the post itself my attempt to port it to masm compatible syntax as well masm macros (But, i dont know if i translated correctly)

Btw, due to the limitations of characters lenght on the forum, i had to remove all the comments on the 1st post., but if needed here are the comments of the functions themselves

Code Select

;;
    SIMD_BinaryScan
        This function performs a binary search for a chunk of data (pSearchPattern) within a larger dataset (pDataBuffer).
        It uses SSE2 instructions for optimization and speed, especially with large datasets.
        The function can be used to find data of any size, including strings (case-sensitive) within a larger text.

    Parameters:
        pDataBuffer (in):       Pointer to the start of the larger dataset in which the search is performed.
                                This is the buffer in which you are searching.
        DataSize (in):          The size (in bytes) of the dataset. The size must be greater than or equal to 16 bytes.
        pSearchPattern (in):    Pointer to the start of the search pattern (the data chunk to search for within the dataset).
                                This is the smaller data chunk you are searching for.
        PatternSize (in):       The size (in bytes) of the search pattern.

    Return Value:
        Success: If the pattern is found, the function returns a pointer to the starting position of the matching chunk in the dataset.
        Failure: Returns 0 (&NULL) if the pattern is not found in the dataset.
;;

Code Select

;;
    CompareSmallDataBlock
        This function performs a byte-by-byte comparison of a small dataset (<= 16 bytes) with a search pattern.
        It is used as a fallback when the dataset is too small for SIMD optimizations.

    Parameters:
        pDataBuffer (in):       Pointer to the start of the dataset.
        DataSize (in):          The size (in bytes) of the dataset.
        pSearchPattern (in):    Pointer to the start of the search pattern.
        PatternSize (in):       The size (in bytes) of the search pattern.

    Return Value:
        Success: If the pattern is found, the function returns a pointer to the starting position of the match.
        Failure: Returns 0 (&NULL) if the pattern is not found.
;;

Code Select

;;
    VerifyPatternMatch
        This function performs a detailed comparison of a search pattern with a dataset block.
        It is used after a potential match is found in the main function to confirm the match.

    Parameters:
        pDataBuffer (in):       Pointer to the start of the dataset block.
        pSearchPattern (in):    Pointer to the start of the search pattern.
        PatternSize (in):       The size (in bytes) of the search pattern.

    Return Value:
        Success: If the pattern matches, the function returns a pointer to the starting position of the match.
        Failure: Returns 0 (&NULL) if the pattern does not match.
;;

And below is a example (I hope) how to call this function:

Code Select

; Define data and code sections
.data
    ; Define example data
    haystack db "This is a simple string to search within. While trying to find sinp from sinples string", 0
    needle db "search wi", 0

    ; Variables to store string lengths
    haystack_len dd ?
    needle_len dd ?

    ; Variable to store the search result
    result dd ?

.code
start:
    ; Calculate the length of the haystack string
    invoke lstrlen, addr haystack
    mov haystack_len, eax

    ; Calculate the length of the needle string
    invoke lstrlen, addr needle
    mov needle_len, eax

    ; Call the SIMD_BinaryScan function
    invoke SIMD_BinaryScan, addr haystack, haystack_len, addr needle, needle_len
    mov result, eax

    ; Check the result
    .if result == 0
        ; If the result is 0, the pattern was not found.
        print "Pattern not found.", 13, 10
    .else
        ; If the result is non-zero, the pattern was found.
        print "Pattern found at position: "
        print str$(result), 13, 10
    .endif

    ; Terminate the program
    invoke ExitProcess, 0

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: zedd on March 18, 2025, 09:25:06 PM

Hi guga,

I got the source code from the masm compatible .zip file to assemble okay. It will take a little time to get it ready to test it.

But....
But upon actually reading post #1 I see that you have a 'fixed' masm version, with proper names, etc. and should be ready to assemble. I did not read the entire post before downloading the .zip file. :tongue:

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: zedd on March 18, 2025, 10:03:06 PM

After copying each masm compatible procedure from post #1 and the macros there to a new file. I had to correct the starting line and ending line for each procedure.

After fixing that minor annoyance and adding some necessities, upon attempting to assemble the file:

Code Select

Microsoft (R) Macro Assembler Version 14.00.24210.0
Copyright (C) Microsoft Corporation.  All rights reserved.

 Assembling: C:\Users\Administrator\Desktop\SIMD_BinaryScan_masm.asm

***********
ASCII build
***********

C:\Users\Administrator\Desktop\SIMD_BinaryScan_masm.asm(10) too large for specified size
C:\Users\Administrator\Desktop\SIMD_BinaryScan_masm.asm(10) : C:\Users\Administrator\Desktop\SIMD_BinaryScan_masm.asm(11) too large for specified size
C:\Users\Administrator\Desktop\SIMD_BinaryScan_masm.asm(11) : error A¿C:\Users\Administrator\Desktop\SIMD_BinaryScan_masm.asm(12) too large for specified size
C:\Users\Administrator\Desktop\SIMD_BinaryScan_masm.asm(12) : fatal 掟睦w_
Assembly Error
Press any key to continue . . .

I don't think ml.exe likes your macros, guga.

The macros and masm compatible procedures from post #1 are in this source code, it would have been easier if they were all in the same code block in post #1:

Code Select

include \masm32\include\masm32rt.inc
.xmm

SIMD_BinaryScan        proto :DWORD, :DWORD, :DWORD, :DWORD
CompareSmallDataBlock  proto :DWORD, :DWORD, :DWORD, :DWORD
VerifyPatternMatch      proto :DWORD, :DWORD, :DWORD

; ---------------------------------------------------------------------------

if macro condition
    .if condition
endif

else macro
    .else
endif

elseif macro condition
    .elseif condition
endif

endif macro
    .endif
endif

; MASM Macros for loops
repeat macro
    .repeat
endm

until macro condition
    .until condition
endm

; MASM Macros for bit manipulation
bsf macro dest, src
    bsf dest, src
endm

bsr macro dest, src
    bsr dest, src
endm

; MASM Macros for SSE2 operations
SSE2_EXPAND_BYTE macro xmmreg, byteval
    movd xmmreg, byteval
    punpcklbw xmmreg, xmmreg
    punpcklwd xmmreg, xmmreg
    pshufd xmmreg, xmmreg, 0
endm

CLEAR_LEFTMOST_SET16 macro reg
    mov ax, reg      ; Copia o valor do registrador para AX.
    add reg, -1      ; Subtrai 1 do registrador (inverte todos os bits até o primeiro bit definido).
    and reg, ax      ; Aplica a máscara para limpar o bit mais à esquerda.
endm

GET_FIRST_BIT_SET macro reg
    bsf eax, reg      ; Encontra o índice do primeiro bit definido em `reg` e armazena em EAX.
endm

GET_LAST_BIT_SET macro reg
    bsr eax, reg      ; Encontra o índice do último bit definido em `reg` e armazena em EAX.
endm

; ---------------------------------------------------------------------------

.code

; ---------------------------------------------------------------------------

SIMD_BinaryScan proc pDataBuffer:DWORD, DataSize:DWORD, pSearchPattern:DWORD, PatternSize:DWORD
local FirstBitPos:DWORD, ScanLimit:DWORD, StringEnd:DWORD
local XmmPreserve[20]:BYTE
    xor eax, eax
    mov esi, [DataSize]
    if (esi < [PatternSize]) || ([PatternSize] == 0)
        mov eax, 0
        ret
    elseif ([DataSize] <= 16)
        invoke CompareSmallDataBlock, [pDataBuffer], [DataSize], [pSearchPattern], [PatternSize]
        ret
    endif
    lea eax, [XmmPreserve]
    movdqu [eax], xmm0
    movdqu [eax+16], xmm1
    movdqu [eax+32], xmm2
    movdqu [eax+48], xmm3
    mov esi, [pSearchPattern]
    mov ebx, [PatternSize]
    movzx eax, byte ptr [esi]
    SSE2_EXPAND_BYTE xmm2, eax
    lea eax, [esi+ebx-1]
    movzx eax, byte ptr [eax]
    SSE2_EXPAND_BYTE xmm3, eax
    mov esi, [pDataBuffer]
    mov edi, [DataSize]
    mov eax, [PatternSize]
    shr eax, 4
    add eax, 16
    sub edi, eax
    if edi > 16
        mov edi, [DataSize]
        sub edi, [PatternSize]
    endif
    lea eax, [esi+edi]
    mov [ScanLimit], eax
    mov eax, [DataSize]
    lea eax, [esi+eax]
    mov [StringEnd], eax
    .repeat
        mov [FirstBitPos], 0
        movups xmm0, [esi]
        mov eax, [PatternSize]
        movups xmm1, [esi+eax-1]
        pcmpeqb xmm0, xmm2
        pcmpeqb xmm1, xmm3
        pand xmm1, xmm0
        pmovmskb ebx, xmm1
        .if ebx != 0
            .repeat
                GET_FIRST_BIT_SET ebx
                mov [FirstBitPos], eax
                lea edi, [esi+eax]
                invoke VerifyPatternMatch, edi, [pSearchPattern], [PatternSize]
                .if eax != 0
                    jmp L2
                .endif
                CLEAR_LEFTMOST_SET16 bx
            .until bx == 0
            mov eax, [FirstBitPos]
            add eax, [PatternSize]
        .else
            pmovmskb eax, xmm0
            .if eax != 0
                GET_LAST_BIT_SET eax
                inc eax
            .else
                mov eax, 16
            .endif
        .endif
        add esi, eax
    .until esi > [ScanLimit]
    xor eax, eax
    .if [FirstBitPos] == 0
        mov eax, [StringEnd]
        sub eax, esi
        .if [PatternSize] <= eax
            invoke CompareSmallDataBlock, esi, eax, [pSearchPattern], [PatternSize]
        .else
            xor eax, eax
        .endif
    .endif
L2:
    lea esi, [XmmPreserve]
    movdqu xmm0, [esi]
    movdqu xmm1, [esi+16]
    movdqu xmm2, [esi+32]
    movdqu xmm3, [esi+48]
    ret
SIMD_BinaryScan endp

; ---------------------------------------------------------------------------

CompareSmallDataBlock proc pDataBuffer:DWORD, DataSize:DWORD, pSearchPattern:DWORD, PatternSize:DWORD
local EndString:DWORD
    mov esi, [pDataBuffer]
    mov eax, [DataSize]
    lea eax, [esi+eax]
    sub eax, [PatternSize]
    mov [EndString], eax
    mov edi, [pSearchPattern]
    .repeat
        mov ecx, [PatternSize]
        xor eax, eax
        .repeat
            movzx edx, byte ptr [esi+eax]
            cmp dl, byte ptr [edi+eax]
            jne L1
            inc eax
            dec ecx
        .until ecx == 0
        mov eax, esi
        ret
    L1:
        inc esi
    .until esi > [EndString]
    xor eax, eax
    ret
CompareSmallDataBlock endp

; ---------------------------------------------------------------------------

VerifyPatternMatch proc pDataBuffer:DWORD, pSearchPattern:DWORD, PatternSize:DWORD
local Reminder:DWORD
    mov edx, [PatternSize]
    mov esi, [pDataBuffer]
    mov edi, [pSearchPattern]
    and edx, -16
    mov eax, edx
    xor edx, [PatternSize]
    .if edx != 0
        lea esi, [esi+eax]
        lea edi, [edi+eax]
        mov dh, dl
        and dl, -4
        xor dh, dl
        .if dl != 0
            mov eax, [esi]
            cmp eax, [edi]
            jnz L4
            add esi, 4
            add edi, 4
            add dl, -4
            jz L2
            mov eax, [esi]
            cmp eax, [edi]
            jnz L4
            add esi, 4
            add edi, 4
            add dl, -4
            jz L2
            mov eax, [esi]
            cmp eax, [edi]
            jnz L4
            add esi, 4
            add edi, 4
        .endif
    L2:
        .if dh != 0
            mov eax, 0
            mov dl, byte ptr [esi]
            cmp dl, byte ptr [edi]
            jnz L4
            dec dh
            jz L2
            mov dl, byte ptr [esi+1]
            cmp dl, byte ptr [edi+1]
            jnz L4
            dec dh
            jz L2
            mov dl, byte ptr [esi+2]
            cmp dl, byte ptr [edi+2]
            jnz L4
        .endif
        mov eax, [pDataBuffer]
    .else
        mov eax, esi
    .endif
L4:
    ret
VerifyPatternMatch endp

start:

end start

I can go no further with this.

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on March 19, 2025, 01:22:57 AM

Hi Larry

Tks a lot

Better would use the disassembled version, until someone port it to the default masm macros set. I gave a try porting it, but i´m not used to masm syntax since years, and had to translate the macros with ChatGPT (In fact, deepseek did the translation).

But, if you are used to masm macros system, it shouldn't be hard to port. RosAsm If macro set works the same as in Masm version the If_Or and If_And macros in RosAsm with maybe similar to If xxx= 5 & yy= 6 or If xxx= 10 OR yyy = 30 (If i remember, masm If macro set allowed something like that, right ? I mean, using multiple If variables on the same line with some boolean &, OR, right ?)

The while and end_while macro in RosAsm, is the same as while endwhile in masm, except it don´t accept multiple conditions.
The do loop_until macro in RosAsm is also similar to while macro in masm, except it don´t have the jmp opcode at the start. so, it works like:

Start:
mov eax 5
dec eax
cmp eax 0 | jne Start

or je, jle etc etc depending teh used RosAsm token, but the functionality is like that.

The uses macro in rosasm is just to preserve registers on a procedure. It´s a simple push reg at the start of a function and correspondant pop reg at the end (i guess masm has also something like that).

In RosAsm to make the source better to be read we can use separators "|" to dislay the code on a same line. THis is unused during compilation time, it´s just a token to make the code more readable, specially if the code have several lines, and we want it do display in one only. So

mov eax 5000 | shr eax 3 | inc eax

is the same as
mov eax 5000
shr eax 3
inc eax

The usage of "," in between the opcodes is optional. So you can also write it as:
mov eax, 5000
shr eax, 3
inc eax

Or any equivalent in Masm. - I personally, don´t use the comma´s too often for short opcodes. I use it more in call function with several parameters. But in RosAsm, it is optional, it won´t affect at all the code since it is handled by the assembler to simply skip those chars on certain conditions.

Now, if the function really work as expected, now it´s time to do the backwards version of it. So, a mirror function to works backwards searching the pattern from end to the beginning of the data chunk. I already did that, but since i updated the forward version, i´ll try to see if i can do the same with the backwards

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: zedd on March 19, 2025, 01:52:35 AM

I'm not very good with 'real' macros, just simple macros.
Using the disassembled version might assemble fine, and actually work, but it would require some modifications regarding all of the 'loc_004xxxxx' labels, etc. and a few other things to make it a proper source code. I don't have a lot of time to do all of that (I have projects of my own to work on), but yes it is possible to do.

Hopefully, someone else here can help you to make a proper Masm32 compatible version of these functions. It looks like a very promising and useful set of functions. :thumbsup:

Cheers...
Larry

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: ognil on March 19, 2025, 03:49:36 AM

Congratulations Guga for SIMD_BinaryScan (equivalent to Instring), :thumbsup:

It's great that you continue to have fun and feed the forum with interesting 32 bit code that our eminent translator zed151 tests and comments on. :badgrin:
I'm also grateful for the IDA translation into MasmCompatible.asm (unfortunately without any comments on the individual lines). :undecided:
For me, this is a perfect example of amateurish attitude towards the Instring function, because:
1. There is no check in the forum or by AI for an existing function written by a professional.
2. Splitting a simple function into three separate functions, etc. :smiley:

I'll stop here and suggest the following:
1. Take a look at the following elegant function

Code Select


;*******************************************************************;
; Immediate byte constants
;EQUAL_ANY	    = 0000b
;RANGES		    = 0100b
;EQUAL_EACH	    = 1000b
;EQUAL_ORDERED	    = 1100b
;NEGATIVE_POLARITY  = 010000b
;BYTE_MASK	    = 1000000b
;*******************************************************************************************************;
align 16
db 8 dup (90h)
InstrOg		proc  
		mov         rax,  rcx                   	; rcx = haystack, rdx = needle    
		movdqu	    xmm2, oword ptr[rdx]		; load 1st 16 bytes of needle
		pxor        xmm3, xmm3
@Loop:								; find the first possible match of 16-byte fragment in haystack       
		pcmpistri   xmm2, oword ptr[rax], EQUAL_ORDERED
		lea         rax, [rax+16]
		ja          @Loop
		lea         rax, [rax+rcx-16]			; save the possible match start
		mov         r9, rdx                      	; r9 = rdx rdx = needle
		mov         r8, rax
		jnc         @NotFound
		sub         r9, rax
@@:								; compare the strings
		movdqu      xmm1, oword ptr [r8 + r9]
		pcmpistrm   xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK ; mask out invalid bytes in the haystack
		movdqu      xmm4, oword ptr[r8]
		pand        xmm4, xmm0
		pcmpistri   xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
		lea         r8, [r8+16]
		ja          @b
		jnc         @Found
		add         rax,1				 ; continue searching from the next byte
		jmp         @Loop
@NotFound:
		xor         eax, eax
@Found:
		ret
InStrOg		endp									
;*****************************************************************;

2. If you like it, just request a translation to MASM32 and testing from zed151.
I hope you continue to have fun and feed the forum with interesting code. :thumbsup:

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on March 19, 2025, 04:39:36 AM

Tks, but Couldn´t insert the comments on each line due to the limitations of the amount of characters on the forum. But, the comments on each line are displayed ion the RosAsm version, so it wouldn´t be hard to follow.

"For me, this is a perfect example of amateurish attitude towards the Instring function, because:
1. There is no check in the forum or by AI for an existing function written by a professional."

About the code...amatory attitude ? Not at all, there are papers describing the same method I used, scanning the data from 16 bytes to 16 bytes. Using 3 functions was mainly to avoid extrapolating the limits of a SSE register and also working on small datasets and small patterns. This function is not yet to be used as a instring on the normal way, since it does not look for null terminated byte, it is used to scan for a chunk of data (pattern) of any size on a dataset also of any size.

Also, it is for 32 bits and not 64 and so far, I´m limited to test with SSE2 mainly.

Your code, although shorter, is for 64 bits, using SSE4, but is limited to 16-byte aligned strings, needles bigger or equal to 16 bytes, null-terminated strings, cases where haystack is large enough, since it seems to not check the size of the pattern or the dataset itself.
(Unless, the code you posted is the core of another function that already check those limitations)

I would like to test it, but I can't port it to 32 bits and it is needed (for now) to be limited to SSE2.

I´ll try to port it to 32 bits using only SSE2 and if succeeded, I´ll post here the results

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on March 19, 2025, 05:14:42 AM

I gave a test trying to port it to 32 bits, but unsure if it is the proper equivalence. If the code below is the equivalent for your´s 64 bits version, then, it is not working:

Is this the proper porting to 32 bits using only SSE2 ?

Code Select

Proc InstrOg:
    Arguments @pHaystack, @pNeedle
    Uses esi, edi, ebx

    mov eax D@pHaystack            ; eax = haystack
    mov esi D@pNeedle
    movdqu xmm2 X$esi        ; load 1st 16 bytes of needle into xmm2
    pxor xmm3 xmm3                 ; xmm3 = 0 (used for comparisons)

@MainLoop:
    ; Compare 16 bytes of haystack with the needle
    movdqu xmm1 X$eax              ; load 16 bytes of haystack into xmm1
    pcmpeqb xmm1 xmm2              ; compare bytes for equality
    pmovmskb ecx xmm1              ; create a bitmask of the comparison results
    test ecx ecx                   ; check if any bytes matched
    jnz @PossibleMatch              ; if any match, jump to PossibleMatch

    add eax 16                     ; move to the next 16-byte block in haystack
    jmp @MainLoop                   ; continue searching

@PossibleMatch:
    ; A possible match was found. Now verify the full string.
    mov esi eax                    ; esi = start of the possible match in haystack
    mov edi D@pNeedle              ; edi = needle
    mov ebx 16                     ; ebx = number of bytes to compare

@VerifyLoop:
    movdqu xmm1 X$esi              ; load 16 bytes of haystack into xmm1
    movdqu xmm4 X$edi              ; load 16 bytes of needle into xmm4
    pcmpeqb xmm1 xmm4              ; compare bytes for equality
    pmovmskb ecx xmm1              ; create a bitmask of the comparison results
    cmp ecx 0FFFF                 ; check if all 16 bytes matched
    jne @NoMatch                    ; if not, jump to NoMatch

    add esi 16                     ; move to the next 16-byte block in haystack
    add edi 16                     ; move to the next 16-byte block in needle
    sub ebx 16                     ; decrement the remaining bytes to compare
    jnz @VerifyLoop                 ; if more bytes to compare, continue

    ; Full match found
    mov eax esi                    ; return the start of the match
    jmp @Done

@NoMatch:
    ; No match, continue searching from the next byte
    add eax 1                      ; move to the next byte in haystack
    jmp @MainLoop                   ; continue searching

@Done:
    ; Restore registers and return
    mov eax esi                    ; return the start of the match (or 0 if not found)

EndP

[TestingString2: B$ "578987hGugf8ahj1g1ghdlgjGughjkhjk8ahj18ahj1g1ghdmerdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj", 0]
[CharToFind2: B$ "merdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj", 0]

call InstrOg TestingString2, CharToFind2

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: six_L on March 19, 2025, 08:33:29 PM

Hi,guga

His codes works fine.

Code Select

.data
	; Define example data
	haystack	db "578987hGugf8ahj1g1ghdlgjGughjkhjk8ahj18ahj1g1ghdmerdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj", 0
	needle		db "merdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj", 0
.code
;¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
InStrOg proc plBuf:QWORD,plNed:QWORD
	
	mov         rax,  rcx                   	; rcx = haystack, rdx = needle   
	movdqu	    xmm2, oword ptr[rdx]		; load 1st 16 bytes of needle
	pxor        xmm3, xmm3
@Loop:							; find the first possible match of 16-byte fragment in haystack       
	pcmpistri   xmm2, oword ptr[rax], EQUAL_ORDERED
	lea         rax, [rax+16]
	ja          @Loop
	lea         rax, [rax+rcx-16]			; save the possible match start
	mov         r9, rdx                      	; r9 = rdx rdx = needle
	mov         r8, rax
	jnc         @NotFound
	sub         r9, rax
@@:							; compare the strings
	movdqu      xmm1, oword ptr [r8 + r9]
	pcmpistrm   xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK ; mask out invalid bytes in the haystack
	movdqu      xmm4, oword ptr[r8]
	pand        xmm4, xmm0
	pcmpistri   xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
	lea         r8, [r8+16]
	ja          @b
	jnc         @Found
	add         rax,1				; continue searching from the next byte
	jmp         @Loop
@NotFound:
	xor         rax, rax
@Found:
	ret

InStrOg endp	
...
	invoke InStrOg, addr haystack, addr needle
	mov result, rax

	; Check the result
	.if result == 0
		; If the result is 0, the pattern was not found.
		invoke	AddLog,CStr("Pattern not found.",13,10)
	.else
		; If the result is non-zero, the pattern was found.
		invoke  RtlZeroMemory,addr szBuf,sizeof szBuf
		mov	rax,result
		lea	rcx,haystack
		sub	rax,rcx
		invoke	wsprintf,addr szBuf,CStr("Pattern found at position(offset): %d ",13,10),rax
		invoke	AddLog,addr szBuf
	.endif

QuotePattern found at position(offset): 48

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on March 20, 2025, 12:20:55 AM

Hi Six_L. Tks for testing

So, it was how i translate it to 32 bits that isn't working. I cannot test 64 bits apps in RosAsm right now. If someone could port this to 32 bits and with SSE2 it could be better for test. But, his version limits the size of strings and patterns or extrapolates the boundaries of a SSE register ? Those are the issues i would like to check.

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: ognil on March 20, 2025, 02:36:26 AM

QuoteIf someone could port this to 32 bits and with SSE2 it could be better for test.

And wouldn't it be better if you changed it to:
If someone could port my code to 64 bits it could be better for test. Аnyway... :thumbsup:

Thanks six_L for your translation, but I don't understand why it includes 2 unnecessary variables that are not used anywhere in the code. I mean plBuf:QWORD and plNed:QWORD. :smiley:

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: six_L on March 20, 2025, 03:33:29 AM

Hi,ognil

QuoteI don't understand why it includes 2 unnecessary variables that are not used anywhere in the code. I mean plBuf:QWORD and plNed:QWORD.

I like the macros. Using the macros makes it easier to grasp the correctness of code from a macroscopic angles. if i have not used the "invoke" macros. then the code change the following:

Code Select

    invoke InStrOg, addr haystack, addr needle
    mov result, rax

==>

Code Select

    
    mov    rcx,offset haystack
    mov    rdx,offset needle        
    call     InStrOg
    mov     result, rax

Regards
six_L

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: ognil on March 20, 2025, 05:40:18 AM

Thanks to six_L for trying to explain the presence of two unused variables in your code.

Unfortunately, this is wrong, because there is nothing different in the result of the two calls you showed, from the debugger's perspective:

Code Select

invoke InStrOg,addr szText,addr szText31 generate in debugger MS VC:
000000014000113D 48 8D 0D 6C 8D 06 00 lea   rcx,[szText (0140069EB0h)]  
0000000140001144 48 8D 15 74 8D 06 00 lea   rdx,[szText31 (0140069EBFh)]  
000000014000114B E8 3D 82 00 00       call  InStrOg (014000938Dh)

In other words, the macro does not write the variables plBuf:QWORD and plNed:QWORD into the code!!.
This is done by the programmer additionally, without being necessary in this case.
This is one of the reasons why I do not use macros, because then a mandatory check with the debugger must be done.

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: TimoVJL on March 20, 2025, 06:29:05 AM

/Fl⟦filename⟧ Generates an assembled code listing. See /Sf.

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: _japheth on March 20, 2025, 12:45:32 PM

Quote from: ognil on March 20, 2025, 05:40:18 AM2.
Quotemov rcx, offset haystack ; mov rdx, offset needle
Тhis may or may not work depending on how big the variable address is in memory.

Isn't this a questionable statement? "MOV reg64,imm" is actually one of the few instructions that allow 64-bit immediates as source operand (so "mov rdx,offset xxx" is supposed to work for ANY 64-bit address). OTOH, "LEA reg64, [mem64]" has a displacement operand of 32-bit only, so it's restricted to either PC +/- 31-bit offset or, for non-PC-relative addressing, to address the first 4 GB only.

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: six_L on March 20, 2025, 04:15:40 PM

Hi,ognil

1,

QuoteIn other words, the macro does not write the variables plBuf:QWORD and plNed:QWORD into the code!!.
This is done by the programmer additionally, without being necessary in this case.

your "InStrOg" PROC have two parameters[in]. i added the variables plBuf:QWORD and plNed:QWORD into the code for receiving the inputted parameters.
"addr haystack" --> "plBuf" --> "rcx"
"addr needle" --> "plNed" --> "rdx"

Another way(remove the variables plBuf:QWORD and plNed:QWORD):

Code Select

InStrOg proc 
    
    mov         rax,  rcx                       ; rcx = haystack, rdx = needle   
    movdqu        xmm2, oword ptr[rdx]        ; load 1st 16 bytes of needle
    pxor        xmm3, xmm3
@Loop:                            ; find the first possible match of 16-byte fragment in haystack       
    pcmpistri   xmm2, oword ptr[rax], EQUAL_ORDERED
    lea         rax, [rax+16]
    ja          @Loop
    lea         rax, [rax+rcx-16]            ; save the possible match start
    mov         r9, rdx                          ; r9 = rdx rdx = needle
    mov         r8, rax
    jnc         @NotFound
    sub         r9, rax
@@:                            ; compare the strings
    movdqu      xmm1, oword ptr [r8 + r9]
    pcmpistrm   xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK ; mask out invalid bytes in the haystack
    movdqu      xmm4, oword ptr[r8]
    pand        xmm4, xmm0
    pcmpistri   xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
    lea         r8, [r8+16]
    ja          @b
    jnc         @Found
    add         rax,1                ; continue searching from the next byte
    jmp         @Loop
@NotFound:
    xor         rax, rax
@Found:
    ret

InStrOg endp    
...
    ; Call the SIMD_BinaryScan function
    lea    rcx,haystack
    lea    rdx,needle
    invoke    InStrOg
    mov    result, rax

    ; Check the result
    .if result == 0
        ; If the result is 0, the pattern was not found.
        invoke    AddLog,CStr("Pattern not found.",13,10)
    .else
        ; If the result is non-zero, the pattern was found.
        invoke  RtlZeroMemory,addr szBuf,sizeof szBuf
        mov    rax,result
        lea    rcx,haystack
        sub    rax,rcx
        invoke    wsprintf,addr szBuf,CStr("Pattern found at position(offset): %d ",13,10),rax
        invoke    AddLog,addr szBuf
    .endif

2,
in UASM64
"lea rcx,haystack" == "mov rcx, offset haystack"

Regards
six_L

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: ognil on March 21, 2025, 02:09:19 AM

Thanks guys, :smiley:

I don't want to argue with you because this thread is basically cleared of comments and we should respect the work of others. When you post your own competing code against Guga's code, I would be happy to comment on it in detail. :thumbsup:

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on March 24, 2025, 03:44:42 AM

New version coming soon. Succeeded to make no errors so far, and kept the speed for both (match and non match cases). Also succeeded (finally) to force the data to always stays within the limits of a SSE registers. On this way we have 0 chance to an error when trying to load data beyond the limits.. Ex:
[DataSet: B$ "Hello world, i´m trying to find a string", 0] ; last byte is "g" and not 0
[Pattern: B$ "string", 0] ; 1st and last chars are s and g, respectively

On this, way, if the loop happens to be in "ind a string" - < pointer in esi is here. Then, when it try to load in xmm0, since this chunk has only 12 bytes long, it cannot load the extra 13 to 16 bytes. (We are not loading only strings, and we have no guarantees that we still have enough memory after the ending "g" to be loaded, since the data can be located at the very end of a block of memory or physical data (ex: we are trying to locate a chunk of bytes on the end of the .data section of a PE or other type of file and the memory ends immediately after the "g")

I succeeded to get no overheads on the algorithm. Now, i´m trying to improve it a little bit, tweaking how it can handle now chunks of data smaller than 31 bytes long, since those are related both as the remainder bytes of a long chain of data and also as small datasets to be scanned.

So, far, in both i´ve got the worst speed of around 90-96 clock cycles on the 167 types of data used to scan and around 40-43 clock cycles the fastest on my AMD.

Slowest Match cases

Clock cycles: 96 clocks
Example Case: 35

Fastest Match cases

Clock cycles: 43 clocks
Example Case: 21
_______________________________________________

Not Found Min Max Timmings:
_______________________________________________

Slowest Match cases

Clock cycles: 73 clocks
Example Case: 19

Fastest Match cases

Clock cycles: 32 clocks
Example Case: 83

Press enter to exit...

I´ll see if the approach i´m planning to handle those data smaller than 32 bytes will have any impact on the performance of the algorithm. I´ll try creating a hybrid algorithm that can handle only this small blocks of data using SSE2 registers and normal x86 registers (for even smaller chunks) to see if it is worthful than using the currently function as (Which is a bit slow, btw):

Code Select

Proc CompareSmallDataBlock:
    Arguments @pDataBuffer, @DataSize, @pSearchPattern, @PatternSize
    Local @EndString
    Uses esi, edi, edx, ecx

    ; Initialize pointers.
    mov esi D@pDataBuffer
    mov eax D@DataSize
    lea eax D$esi+eax | sub eax D@PatternSize | mov D@EndString eax

    mov edi D@pSearchPattern
    .Do
        ; Compare the pattern byte-by-byte.
        mov ecx D@PatternSize
        xor eax eax
        Do
            movzx edx B$esi+eax | cmp dl B$edi+eax | jne L1>
            inc eax
            dec ecx
        Repeat_Until_Zero
        ; If the pattern matches, return the starting position.
        mov eax esi
        ExitP

    L1:
        ; Move to the next position in the dataset.
        inc esi
    .Loop_Until esi > D@EndString

    ; If no match is found, return 0.
    xor eax eax

EndP

Note: I tested the suggested routines described in several papers (Link1 (https://github.com/WojciechMula/sse4-strstr/blob/master/avx2-strstr-v2.cpp#L107) , link2 (http://0x80.pl/notesen/2016-11-28-simd-strfind.html#generic-sse-avx2) etc etc), but all of them failed in terms of performance, speed and reliability, since they all extrapolated the boundaries of SSE2 registers. Also in some of those papers they suggested that we could ignore the comparison of the 1st and last char before it pass to the core compare function, but this is useless. No actual gain of speed whatsoever. Forcing it to recognize the size of the dataset to just advance 1 bytes and reduce the size in 2, so far caused no impact. IN fact, it leaded the algo to be a bit slower in 3 or 4 clock cycles.
The technique described in those papers are similar to the one i created, except to the fact that my algo respects the limits for a SSE2 register and also is automatically adjustable to any size of data and any size of pattern. The problem with many of those algorithms i tested so far (the one in github as well), is that when it enter on the loop to compare the possible matches, it does not calculate only the matching pairs. Instead, it perform more loops than necessary, and also don´t check the true limit of the dataset , which is not only the cause of the limit extrapolation but also make the functions to be slower because it continues to loop until it reaches the true end of the data, rather than estimate the maximum limit to be performed. For, instance, if the pattern is 10 bytes long, and after all the loops, the ending pointer still needs to check for some remainders. The algorithm must terminate if the rest of the data to be scanned is smaller than the pattern. I.e: say we reach the limit for the remainders, and there´s still 5 bytes to the real end of the dataset. If the pointer is located somewhere between the 10th byte of true end of the data and the limit, then there´s no room to scan, because we cannot have any pattern beyond that. Ex: esi ends in 10 but the true data ending is located in 17. The pattern size = 10 bytes, but, if we are at the end of the limit, the amount of bytes to the end is only 7 bytes, so smaller then the patternsize itself.

Attached the new benchmark version to test.

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on March 24, 2025, 05:08:06 PM

Hi Guys

i updated the algorithm on the 1st post. Also inserted comments and the correspondent masm version to be tested (just replace the xmmword with oword - or equivalent for SSE2 in masm).

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on March 25, 2025, 09:26:39 AM

As reported on the 1st post Link (https://masm32.com/board/index.php?topic=12630.0), if anyone else wants to give a try on this algo, or improve it for SSE2 obeying the limits of a xmm register (if possible), here are a couple of things that may be done:

1 - Create a flag before the loop for the cases where the pattern is only 2 bytes. True = Match Pairs ok. False = Not yet, the pattern is bigger than 2 bytes. This flag can be accessed right after the .Test_If_Not_Zero edx, something like.:

(...)
.Test_If_Not_Zero edx
Do
; Find the first set bit in the bitmask.
GET_FIRST_BIT_SET edx | mov D@FirstBitPos eax
cmp D@PatternSize 2 | jz L2>> So, since the pattern is only 2 bytes and we already found them from the SIMD computation, there´s no need to go further onto the call to the helper functions.

2 - make usage of xmm4 and xmm5 to store the 2nd and penultimate byte from the pattern and extend them to the correspondent SSE registers. Then on the SImD comparison , use registers xmm6 and xmm7 to store them and also be compared as we did for xmm0 and xmm1. On this way, we will compare at once, 4 bytes from the pattern instead of 2. And then simply add another pand between the resultant pairs and create the mask on edx as we did in pmovmskb edx XMM1. The problem is that, if you use the 2nd and penultimate byte, you also needs to adjust the pointer to esi to they points to the proper places, something like:

movq XMM0 X$esi ; Load the first 16 bytes of the current block into xmm0. Var_BlockFirst.
movq XMM1 X$esi+ebx-1 ; Load the last 16 bytes of the current block into xmm1. Var_BlockLast.
movq XMM6 X$esi+1 ; Load the 2nd 16 bytes of the current block into xmm0. Var_BlockFirst.
movq XMM7 X$esi+ebx-2 ; Load the penultimate 16 bytes of the current block into xmm1. Var_BlockLast.

The main problem will relies in obeying the boundaries of the registers to avoid it tries to point to addresses outside the limits of the data. So, you most likely will need to use another x86 register to adjust those pointers dynamically. Something like:

movq XMM0 X$esi ; Load the first 16 bytes of the current block into xmm0. Var_BlockFirst.
movq XMM1 X$esi+ebx-1 ; Load the last 16 bytes of the current block into xmm1. Var_BlockLast.
mov eax esi | add eax D@SecondByteFlag
movq XMM6 X$eax ; Load the 2nd 16 bytes of the current block into xmm6. Var_BlocSecond.
mov eax esi | sub eax D@PenultimateByteFlag
movq XMM7 X$eax+ebx ; Load the penultimate 16 bytes of the current block into xmm7. Var_BlockPenultimate.

Here on this example, those flags represents the amount of bytes to advance or subtract according to the size of the pattern. This must be done before the loop as well. Something like:

If pattern = 2 bytes, D@SecondByteFlag = 0, D@PenultimateByteFlag = 1 ; always starts with 1 subtraction
If pattern = 3 bytes, D@SecondByteFlag = 1, D@PenultimateByteFlag = 1 ; Pattern is odd, so we need only to point the last byte to itself (-1 as above)
If pattern >= 4 bytes, D@SecondByteFlag = 1, D@PenultimateByteFlag = 2 ; Pattern is bigger or equal to4 bytes (odd or even, it won´t matter), so we need only to compare the 2 pairs of the start of the pattern and 2 pairs at the end, regardless the size of the pattern.

In theory, this could work and force the algorithm to advance 16 bytes more often making it be a bit faster, since the chances of 4 matching pairs at those same positions are low in real life applications, in most of the cases (Specially if the data set is composed of random bytes).
The only problem i see in adding those 2 more techniques is if that will really increase more the speed or we just added overheads to the algo, making it slower.

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: zedd on March 26, 2025, 03:23:44 PM

Results from running executable in "BinScan24Mar2025.zip" attached as a .txt file (not zipped)

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on March 26, 2025, 03:50:09 PM

Tks a lot, Zedd :thumbsup: :thumbsup: :thumbsup:

It seems to be working as expected now. Faster and handling all the data. I´m currently working on the reversed function now. I thought it was easier to develop the exact mirror of SIMD_BinaryScan, since i succeeded to fix the original version, but it´s a bit hard to make it stays in the limits when working backwards. I´m trying to see why this is happening, and when succeed, i´ll post here also the backwards function.

Those functions to properly scan data (forward or backward) are the ones i need to fix soon because i´m trying to use them not only inside RosAsm, but making them in general usage as well. Im currently trying to fix several old functions in RosAsm to release a newer version (If i succeed to make the functions works more independently from each other and create the proper dlls for that). I need to fix RosAsm and make it easier to handle the disassembler, debugger, decoder and encoder and also create the proper internal tables to handle the necessary data. On this way, it could be easier to someday - eventually - i decide to create a 64bit version for RosAsm as well. Personally, i´m not a big fan of 64 bits apps, but they are in fact needed nowadays, specially when developing plugins or small apps to work with image, audio and video processing. But, making a 64 bit version of RosAsm is impossible to port from the current Source code - That´s why i´m trying to fix it.

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on April 02, 2025, 12:44:27 AM

Hi guys, the reverse function is almost finished. I succeeded to make the reverse function (Scan from end to start) with similar speed as the normal version. I found only 1 mistake on my recent tests and it seems to happens on aligned strings (aligned by 16 bytes). I´m pretty sure it is missing only 1 byte somewhere, i´ll try to fix it before the next release. Also, i updated the code again on the normal function. (I focused more on avoid the extrapolation, but the speed is the same as the last one - or faster in some cases). Once i finishes i´ll post it here.

Next steps of those algorithms i´ll focus on strings:
1 - Create the correspondent search for strings on case insensitive mode, containing also special Ascii chars (latin origins, such as portuguese, french, german, italian, spanish, swiss and some nordic/scandinavian languages: danish, norwegian, swedish)
2 - Create a routine to skip some user defined chars
3 - Create a routine to check for the boundaries of a string (word edge). In fact, the word edge is already implemented, but i need to adapt the updated version i made for RosAsm onto this new string search routines
4 - Unicode version (UTF16), but the simpler one...The one that uses only the char followed by 0, such as in Unicode C, Pascal style, Delphi style etc (Sorry, no plans to work so soon on chinese, japanese, russian or other fancy unicode chars)

Note: For fancy languages ( russian, chinese, japanese etc) that uses UTF16 or even UTF32 etc, the normal algorithms (SIMD_BinaryScan and it´s reversal), should work already, but it won´t work for case insensitive yet, because i have no idea if (or how) such languages deals with case insensitive strings)

References for the Unicode versions to analyze later:
What is unicode ? (https://www.ime.usp.br/~pf/algorithms/appendices/unicode.html)
The ISO latin table (https://www.ime.usp.br/~pf/algorithms/appendices/iso-latin-1.html)
UTF8 encoding (https://www.utf8-chartable.de/unicode-utf8-table.pl)
UTF16 (https://en.wikipedia.org/wiki/UTF-16)

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on April 02, 2025, 12:03:32 PM

Finished the backwards algo. I´ll clean-up the code and put the comments on it and post here both (The normal version updated and this one). I tested both and they seems to work without errors (as far i saw). I loaded a big text (Around 6.5 Mb) from Project Guttenberg called: "The Project Gutenberg EBook of The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle (#15 in our series by Sir Arthur Conan Doyle)" and starting scanning it. The test consists in loading the text in memory and creating patterns starting from 1 Byte that are incremented on each loop. So, on the 1st iteration, it take 38 bytes from the start, used it as a pattern and scan it on the big text. On the next iteration, it adds 1 byte to the older pos and then search 38 bytes as a pattern on the big text and it continues incrementing and scanning until it reaches the end of the 6.5 Mb file. I only limited the size of the pattern to a maximum of 38 bytes in order to see it the algo will crash even considering that the pattern position is being incremented at each loop, and therefore we could reach some unaligned and aligned text to perform the scans, depending of position of the newly created pattern on each loop. Fortunately, it always found the patterns created and never bypassed the limits of SSE registers in any of it´s internal functions.

So it processed something around 6.5Mb * 6.5Mb (42.5 Tb) of information. On my tests it took a bit more than one hour to perform the whole thing. So, not counting the limitation of my CPU, the speed of this thing (in theory) can reach something around 11.74 Gb/sec.

The tests i did was like this:

Code Select

[MyTxt: D$ 0] ; Pointer to the contents of Big.txt
[MyTxtSize: D$ 0] ; Size of Big.txt
[GugaDymmyCnt: D$ 0]

    call 'FastCRT.ReadOpenedFile' &NULL, {B$ "big.txt", 0}, MyTxt, MyTxtSize, 0
    mov eax D$MyTxtSize | mov D$GugaDymmyCnt eax
    mov edi D$MyTxt
    Do
        call SIMD_BinaryScan_Back D$MyTxt, D$MyTxtSize, edi, 38
        If eax = 0
            mov eax eax ; If debugger paused here, it means we found an error. Fortunately, it never reached this point, meaning it always found the patterns it was searching for :) 
        End_If
        inc edi
        dec D$GugaDymmyCnt ; dummy counter to estimate when this thing will stop.
        lea eax D$edi+38
    Loop_Until B$eax = 0
    call 'KERNEL32.ExitProcess' 0

Once i finish the comments i´ll upload it here.

One question. The reversed functions uses as an input the starting address of the text (The same as in the regular version), and internally it points it to the end to it starts scanning backwards.I made this only to maintain some correspondence with the normal version. But, i´m considering adding a flag where the user can input to start the scanning from the end of the string to the start.

For example:

Code Select

[MyTxt: B$ "I say, Berg, my dear fellow," said Rostov, "when you get a letter
from home and meet one of your own people whom you want to talk
everything over with, and I happen to be there, I'll go at once, to be
out of your way! Do go somewhere, anywhere... to the devil!" he
exclaimed, and immediately seizing him by the shoulder and looking
amiably into his face, evidently wishing to soften the rudeness of his
words, he added, "Don't be hurt, my dear fellow; you know I speak from
my heart as to an old acquaintance.", 0]

[MyTxtSize: D$ 509]

[MyPattern: B$ "wishing", 0]
[MYpatternSize: D$ 7]

[SCAN_FROM_END 0]
[SCAN_FROM_START 1]

Code Select

call SIMD_BinaryScan_Back MyTxt, D$MyTxtSize, MyPattern, D$MYpatternSize, SCAN_FROM_START

It will scan the string "wishing" from the start "I say,..." to the end

Or

Code Select

call SIMD_BinaryScan_Back MyTxt, D$MyTxtSize, MyPattern, D$MYpatternSize, SCAN_FROM_END
It will scan the string "wishing" from the end of the text " an old acquaintance." to the start.

Later i´ll rename those flags, this was just a idea. Does a flag to do this is better, or i keep the algo without it scanning as default always from start to the end ?

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on April 04, 2025, 03:20:13 PM

Hi Guys

Finished the implementation of the new version that works with Strings (case insensitive). So far,i´m working with Basic Latin (A to Z, a to z) so i can check the speed. It´s a bit slower than i expected, because i needed to create another helper function to store small chunks (less than 16 bytes) to a variable (In fact the function is called twice, since it needs to copy both the chunk related to the dataset and the pother related to the pattern.

Can someone test to see if the results are ok ? I mean, check if it really found the strings or if it failed to find the pattern (when it was existent in the dataset) ?

Also..is there a faster way to copy a maximum of 16 bytes to a variable (unfortunately, they will need to be on their own functions as well) ? I was considering using repe movsb etc, but i think they still can be a bit slow, right ?

Src is embedded in the app, btw. Later, if it really is working as expected, i´ll post here the source and the translation to masm (i´ll put the other versions as well - the backwards and forwards updated i previously did)

If everything is ok, and if i suceed to gain a bit more of speed on this, i´ll create another function adapted to work with all Latin Strings (german, portuguese, french, italian, spanish, and some Norwegian chars too) and later, will do the same for the backwards version of those things.

The next step is adapting both functions to handle word edges (i already made it for rosasm, but i´ll need to adapt to those functions), and also a routine to skip certain chars

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on April 04, 2025, 06:17:32 PM

Found it how to not ruin the performance. made a simple version for copying small chunks of data

Code Select


Proc Memcpy_SmallFast:
    Arguments @pDest, @pSource, @Length
    Uses esi, edi, ecx

    mov edi D@pDest       ; Destination
    mov esi D@pSource     ; Source
    mov ecx D@Length      ; Size (max 16)

    If ecx >= 8
        movsd | movsd | sub ecx 8 | jz @DoneCopy
    End_If

    If ecx >= 4
        movsd | sub ecx 4 | jz @DoneCopy
    End_If
    rep movsb
@DoneCopy:

EndP

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: jj2007 on April 04, 2025, 06:43:18 PM

See Faster Memcopy (https://masm32.com/board/index.php?topic=4067.0) in the Lab

Title: Re: SIMD_BinaryScan (Instring equivalent)
Post by: guga on April 05, 2025, 06:05:08 PM

Tks, JJ

i had forget about that routine. I tested it, but, since it is for only for a maximum 16 bytes, the normal x86 (Memcpy_SmallFast) works better inside the context of the whole function.

Text only | Text with Images

SMF 2.1.6 © 2025, Simple Machines