Hi Guys
Updated (24/03/2025):
Sometime ago i started creating a fast Instring function to be used as described at here (https://masm32.com/board/index.php?topic=11088.msg133567#msg133567)
The function is a pattern compare that later i´ll use it as a InString (IN fact, it can be used already as such). Recently i found small issue in a test i was doing where it mismatched one of the patterns, so one pattern was not found on a string, but it was there.
Can someone help me to show your results and also if one of the cases tested failed to identify the pattern ?
Updated Post:RosAsm codeProc SIMD_BinaryScan:
Arguments @pDataBuffer, @DataSize, @pSearchPattern, @PatternSize
Local @FirstBitPos, @ScanLimit, @StringEnd, @pFunction
Structure @XmmPreserve 80, @XMMReg0Dis 0, @XMMReg1Dis 16, @XMMReg2Dis 32, @XMMReg3Dis 48, @XMMReg4Dis 64
Uses esi, ebx, edi, edx
; Preserve XMM registers efficiently
movdqu X@XMMReg0Dis XMM0
movdqu X@XMMReg1Dis XMM1
movdqu X@XMMReg2Dis XMM2
movdqu X@XMMReg3Dis XMM3
mov esi D@DataSize
; edi = DataSize
; edx = Bytes for ScanLimit
; ebx = PatternSize
; Security: Pattern size cannot be bigger than data size, and it cannot be zero. Also we precalculate ScanLimit Bytes
...If_Or esi < D@PatternSize, D@PatternSize = 0
mov eax 0-1 ; Return -1 for invalid input (PatternSize > DataSize or 0)
...Else
; Initialize pattern and buffer pointers
mov eax D@pDataBuffer
mov edi D@DataSize | lea eax D$eax+edi | mov D@StringEnd eax ; StringEnd = pDataBuffer + DataSize
mov esi D@pSearchPattern
mov ebx D@PatternSize
; Precompute pattern bytes for SIMD comparison
movzx eax B$esi | SSE2_EXPAND_BYTE XMM2 eax ; First byte in XMM2
lea eax D$esi+ebx-1 | movzx eax B$eax | SSE2_EXPAND_BYTE XMM3 eax ; Last byte in XMM3
; Select verification function based on PatternSize
mov D@pFunction ComparePattern16ByteBlocks ; Default for PatternSize > 16
; Check if PatternSize >= 16 (shift by 4 = divide by 16)
; If >= 16, keep ComparePattern16ByteBlocks
; Otherwise, use byte/dword scalar comparison
mov eax ebx | shr eax 4 | jnz L1> | mov D@pFunction ComparePatternBytes | L1:
mov esi D@pDataBuffer
; Handle small datasets (< 32 bytes)
; edi = DataSize / 32
; If >= 32, proceed to main loop
mov eax edi | shr edi 5 | jnz L1> ; (DataSize/32)
; eax = DataSize - PatternSize
; Check if remaining size fits 16-byte SSE register
; if data is smaller then 32 bytes we need to see if the pattern fits the SSe register size limit (16 bytes)
sub eax ebx | shr eax 4 | jnz L1> ; (DataSize-PatternSize)/16 . If >= 16, use main loop
call ScanSmallSSE2 esi, D@DataSize, D@pSearchPattern, D@PatternSize
jmp L2>>
L1:
; Compute ScanLimit for main loop (aligns to 16-byte boundary)
; Calculate ScanLimit
mov eax D@DataSize ; eax = DataSize
and eax 0-16 ; eax = DataSize & 0xFFFFFFF0 . Align DataSize down to multiple of 16
lea eax D$esi+eax+1 ; eax = esi + (DataSize & 0xFFFFFFF0) + 1 . ScanLimit = pDataBuffer + aligned DataSize + 1
mov D@ScanLimit eax ; Store result in ScanLimit
; Main loop: Process the dataset in blocks of 16 bytes. ebx = PatternSize. The register must not be changed
.Do
mov D@FirstBitPos 0
movdqu XMM0 X$esi ; Load the first 16 bytes of the current block into xmm0. Var_BlockFirst.
movdqu XMM1 X$esi+ebx-1 ; Load the last 16 bytes of the current block into xmm1. Var_BlockLast.
pcmpeqb xmm0 xmm2 ; Compare the first byte of the pattern with the first byte of the block.
pcmpeqb xmm1 xmm3 ; Compare the last byte of the pattern with the last byte of the block.
; Combine the masks to find positions where both first and last bytes match.
pand xmm1 xmm0 ; xmm1 = combined mask (Var_EqFirst AND Var_EqLast).
; Generate a bitmask from the combined mask.
pmovmskb edx XMM1 ; ebx = bitmask of matching positions. Maximum value = 0FFFF
; Check if there are any matches.
.Test_If_Not_Zero edx
; Process each matching position.
Do
; Find the first set bit in the bitmask.
GET_FIRST_BIT_SET edx | mov D@FirstBitPos eax
; Calculate the address of the matching position.
lea edi D$esi+eax ; edi = potential match address
; Perform a detailed comparison of the pattern at this position.
call D@pFunction edi, D@pSearchPattern, D@PatternSize
cmp eax 0 | jnz L2>> ; If a match is found, return the position.
; Clear the leftmost set bit in the bitmask and save the result in dx.
CLEAR_LEFTMOST_SET16 dx ; Clear processed bit
Repeat_Until_Zero ; perfom a loop if dx is not zero.
; Update the position to continue searching.
mov eax D@FirstBitPos | add eax D@PatternSize ; Advance past current pattern
.Test_Else
; Check first byte matches to optimize step
; If no matches, find the last set bit in the mask and advance the pointer.
pmovmskb eax xmm0 ; eax = bitmask of first byte matches. Maximum value = 0FFFF
Test_If_Not_Zero eax
; Find the last set bit.
GET_LAST_BIT_SET eax ; Find last first-byte match
inc eax ; Advance to the next position.
; We found at least 1 byte matching, increase the position of the last byte that matched by 1
Test_Else
mov eax 16 ; No matches, step full 16 bytes
Test_End
.Test_End
add esi eax ; Advance buffer pointer
; Check if remaining size is too small for pattern
mov edi D@StringEnd | sub edi esi
If edi < D@PatternSize
xor eax eax | jmp L2>>
End_If
lea eax D$esi+15 ; Prepare next block check
.Loop_Until eax > D@ScanLimit
; Handle tail (remaining bytes < 16)
mov eax D@StringEnd | sub eax esi
If eax => D@PatternSize
call ScanSmallSSE2 esi, eax, D@pSearchPattern, D@PatternSize
Else
xor eax eax
End_If
...End_If
L2:
; Restore XMM registers
movdqu XMM0 X@XMMReg0Dis
movdqu XMM1 X@XMMReg1Dis
movdqu XMM2 X@XMMReg2Dis
movdqu XMM3 X@XMMReg3Dis
EndP
Helper functionsProc ComparePattern16ByteBlocks:
Arguments @pDataBuffer, @pSearchPattern, @PatternSize
Local @Reminder
Uses esi, edx
; Step 1: Check if the length is zero. If so, skip the function.
mov edx D@PatternSize
mov esi D@pDataBuffer
mov edi D@pSearchPattern
; Compute multiples of 16 and remainder
;and edx 0-16 | mov eax edx | xor edx D@PatternSize ; When 0 means lenght is divisible by 16 and we have no remainders.
mov eax edx
and eax 0-16 ; eax = multiples of 16 (aligned down)
and edx (16-1) ; edx = remainder (0-15)
; Process 16-byte blocks with SIMD
.Test_If_Not_Zero eax ; edx = 0 ? no remainders exit
mov D@Reminder edx ; Preserve remainder
; We are working with multiples of 16 bytes, therefore, we can get rid of using shr eax 4 and directly subtract 16 bytes on each loop
; On this way we get a bit more of speed while keeping performance. Interesting is that add eax 0-16 seems a bit faster then sub eax 16
.Do
movdqu xmm0 X$esi ; Load 16 bytes from data
movdqu xmm1 X$edi ; Load 16 bytes from pattern
pcmpeqd xmm0 xmm1 ; Compare 4 dwords (16 bytes) at once
movmskps edx xmm0 ; Extract 4-bit mask (1 per dword)
cmp edx 0F | jnz @NoMatch ; All 4 dwords must match
add esi 16
add edi 16
add eax 0-16 ; subtract by 16 bytes. Faster then sub eax 16
.Repeat_Until_Zero
mov edx D@Reminder ; Restore remainder
.Test_End
; Handle remainder (0-15 bytes, if any). When eax is also 0, we skip this and exit the D@pDataBuffer.
; Why ? Because on the previous test, eax was never 0
.Test_If_Not_Zero edx ; Check if edx is not 0
; Split PatternSize into dwords (dl) and bytes (dh)
mov dh dl
and dl 0-4 ; dl = our Dword computation. So, values multiples of 4 only, on a maximum of 12 ; dl = dwords * 4 (0, 4, 8, 12)
xor dh dl ; dh = our byte computation. So, values are multples of 1 only, on a maximum of 3 ; ; dh = bytes (0, 1, 2, 3)
Test_If_Not_Zero dl ; Compare DWORD remainders.
; If dl = 0, so if we have no dwords remainder, jmp over to compute the byte remainders
mov eax D$esi | cmp eax D$edi | jnz @NoMatch | add esi 4 | add edi 4 | add dl 0-4 | jz @CheckBytes
mov eax D$esi | cmp eax D$edi | jnz @NoMatch | add esi 4 | add edi 4 | add dl 0-4 | jz @CheckBytes
mov eax D$esi | cmp eax D$edi | jnz @NoMatch | add esi 4 | add edi 4
Test_End
@CheckBytes:
; check if we have no byte remainders
Test_If_Not_Zero dh ; dh = our byte computation. So, values are multples of 1 only, on a maximum of 3
mov al B$esi | cmp al B$edi | jnz @NoMatch | dec dh | jz @Match
mov al B$esi+1 | cmp al B$edi+1 | jnz @NoMatch | dec dh | jz @Match
mov al B$esi+2 | cmp al B$edi+2 | jnz @NoMatch
Test_End
.Test_End
@Match:
; If the pattern matches, return the starting position.
mov eax D@pDataBuffer
ExitP
@NoMatch:
; No matches, return 0
xor eax eax
EndP
Proc ComparePatternBytes:
Arguments @pDataBuffer, @pSearchPattern, @PatternSize
Uses esi, edx
; Step 1: Check if the length is zero. If so, skip the function.
mov edx D@PatternSize
mov esi D@pDataBuffer
mov edi D@pSearchPattern
; Split PatternSize into dwords (dl) and bytes (dh)
mov dh dl ; dh = PatternSize copy
and dl 0-4 ; dl = our Dword computation (0, 8, 8, 12). So, values multiples of 4 only, on a maximum of 12
xor dh dl ; dh = our byte computation (0, 1, 2, 3). So, values are multples of 1 only, on a maximum of 3
Test_If_Not_Zero dl ; Compare DWORD remainders. (up to 12 bytes)
; If dl = 0, so if we have no dwords remainder, jmp over to compute the byte remainders
mov eax D$esi | cmp eax D$edi | jnz @NoMatch | add esi 4 | add edi 4 | add dl 0-4 | jz @CheckBytes
mov eax D$esi | cmp eax D$edi | jnz @NoMatch | add esi 4 | add edi 4 | add dl 0-4 | jz @CheckBytes
mov eax D$esi | cmp eax D$edi | jnz @NoMatch | add esi 4 | add edi 4
Test_End
@CheckBytes:
; check if we have no byte remainders (0-3)
Test_If_Not_Zero dh ; dh = our byte computation. So, values are multples of 1 only, on a maximum of 3
mov al B$esi | cmp al B$edi | jnz @NoMatch | dec dh | jz @Match
mov al B$esi+1 | cmp al B$edi+1 | jnz @NoMatch | dec dh | jz @Match
mov al B$esi+2 | cmp al B$edi+2 | jnz @NoMatch
Test_End
@Match:
; If the pattern matches, return the starting position.
mov eax D@pDataBuffer
ExitP
@NoMatch:
; No matches, return 0
xor eax eax
EndP
Proc ScanSmallSSE2:
Arguments @pDataBuffer, @DataSize, @pSearchPattern, @PatternSize
Local @StringEnd, @FirstBitPos, @ScanLimit, @pFunction
; Early exit for tiny datasets (<= 8 bytes)
..If D@DataSize <= 8
call CompareSmallDataBlock esi, D@DataSize, D@pSearchPattern, D@PatternSize
..Else
; Validate inputs
mov esi D@pDataBuffer
mov edi D@DataSize | lea eax D$esi+edi | mov D@StringEnd eax
mov ebx D@PatternSize
; Calculate ScanLimit
mov eax edi ; eax = DataSize
and eax 0-8 ; eax = DataSize & 0xFFFFFFF0
lea eax D$esi+eax+1 ; eax = esi + (DataSize & 0xFFFFFFF0) + 1
mov D@ScanLimit eax ; Store result in ScanLimit
; Select comparison function based on pattern size. For situations where pattern size is betwen 16 and 31
mov D@pFunction ComparePatternBytes ; Default for PatternSize <= 15
If ebx >= 16
mov D@pFunction ComparePattern16ByteBlocks ; For PatternSize 16-31
End_If
; Main loop: Process the dataset in blocks of 16 bytes. ebx = PatternSize. The register must not be changed
.Do
mov D@FirstBitPos 0
movq XMM0 X$esi ; Load the first 16 bytes of the current block into xmm0. Var_BlockFirst.
movq XMM1 X$esi+ebx-1 ; Load the last 16 bytes of the current block into xmm1. Var_BlockLast.
pcmpeqb xmm0 xmm2 ; Compare the first byte of the pattern with the first byte of the block.
pcmpeqb xmm1 xmm3 ; Compare the last byte of the pattern with the last byte of the block.
; Combine the masks to find positions where both first and last bytes match.
pand xmm1 xmm0 ; xmm1 = combined mask (Var_EqFirst AND Var_EqLast).
; Generate a bitmask from the combined mask.
pmovmskb edx XMM1 ; ebx = bitmask of matching positions. Maximum value = 0FFFF
; Check if there are any matches.
.Test_If_Not_Zero edx
; Process each matching position.
Do
; Find the first set bit in the bitmask.
GET_FIRST_BIT_SET edx | mov D@FirstBitPos eax
; Calculate the address of the matching position.
lea edi D$esi+eax
; Perform a detailed comparison of the pattern at this position.
call D@pFunction edi, D@pSearchPattern, D@PatternSize
cmp eax 0 | jnz L2>> ; If a match is found, return the position.
; Clear the leftmost set bit in the bitmask and save the result in dx.
CLEAR_LEFTMOST_SET16 dx
Repeat_Until_Zero ; perfom a loop if dx is not zero.
; Update the position to continue searching.
mov eax D@FirstBitPos | add eax D@PatternSize
.Test_Else
; If no matches, find the last set bit in the mask and advance the pointer.
pmovmskb eax xmm0 ; eax = bitmask of first byte matches. Maximum value = 0FFFF
Test_If_Not_Zero eax
; Find the last set bit.
GET_LAST_BIT_SET eax
inc eax ; Advance to the next position.
; We found at least 1 byte matching, increase the position of the last byte that matched by 1
Test_Else
mov eax 8 ; No matches, advance by 8 bytes.
Test_End
.Test_End
add esi eax ; Advance the dataset pointer.
; Check remaining size
mov eax D@StringEnd | sub eax esi
If eax < D@PatternSize
xor eax eax | ExitP
End_If
lea eax D$esi+8
.Loop_Until eax > D@ScanLimit
; Handle remaining tail
mov eax D@StringEnd | sub eax esi
If eax => D@PatternSize
call CompareSmallDataBlock esi, eax, D@pSearchPattern, D@PatternSize
Else
xor eax eax
End_If
..End_If
L2:
EndP
Proc CompareSmallDataBlock:
Arguments @pDataBuffer, @DataSize, @pSearchPattern, @PatternSize
Local @EndString
Uses esi, edi, edx, ecx
; Initialize pointers.
mov esi D@pDataBuffer
mov eax D@DataSize
; EndString = pDataBuffer + DataSize - PatternSize
lea eax D$esi+eax | sub eax D@PatternSize | mov D@EndString eax
mov edi D@pSearchPattern
.Do
; Compare the pattern byte-by-byte.
mov ecx D@PatternSize
xor eax eax
Do
movzx edx B$esi+eax | cmp dl B$edi+eax | jne L1>
inc eax
dec ecx
Repeat_Until_Zero
; If the pattern matches, return the starting position.
mov eax esi
ExitP
L1:
; Move to the next position in the dataset.
inc esi
.Loop_Until esi > D@EndString
; If no match is found, return 0.
xor eax eax
EndP
Macros used:[CLEAR_LEFTMOST_SET16 | mov ax #1 | add #1 0-1 | and #1 ax]
[GET_FIRST_BIT_SET | bsf eax #1]
[GET_LAST_BIT_SET | bsr eax #1]
[SSE2_EXPAND_BYTE | movd #1 #2 | punpcklbw #1 #1 | punpcklwd #1 #1 | pshufd #1 #1 0]
Masm translation (I hope it is ok, but if not, i included the disassembled source that works - just rename xmmword with oword or other compatible masm token for SSE registers)
New Files Atacched (24/03/2025) with comments (See RosAsmCode_Info.txt)
Bug fixes:- Fixed an error in one of the helper functions that was causing the pattern to not found properly.
- Fixed security to guarantee that the data stays within the limits of a SSE2 register (16 bytes per block)
- Improved speed for Match and Non-Match cases. Now both are faster
- Added more helpers to handle patterns and data sizes smaller than 32 and 16 bytes to gain performance in small data chunks.
- Added comments on all functions.
Note: For SSE2 with x86, i guess i reached my limit on speed the algorithm. In theory, it could be improved a little bit forcing the algo to use the others registers (xmm4 to xmm7) to also look for the 2nd and penultimate byte from the pattern, or even creating now a flag to handle when the pattern is only 2 bytes, so it won´t enter the core function since it already found the matching pair etc etc, but it will result on a more complex code and i´m not sureif the performance will vary that much.
In any case, if you want to test to see if it is possible to optimize it even further (obbeying the limits etc), you can see an alternative solution here at post #31 (https://masm32.com/board/index.php?topic=12630.30)
Hi guga!
Too many examples in the test. The entirety of the results was greater than the 20000 character limit per post, so I could not post it all.
Here is the Min/Max results:
Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz 3.60 GHz
_______________________________________________
Match Min Max Timmings:
_______________________________________________
Slowest Match cases
Clock cycles: 94 clocks
Example Case: 160
Fastest Match cases
Clock cycles: 38 clocks
Example Case: 3
_______________________________________________
Not Found Min Max Timmings:
_______________________________________________
Slowest Match cases
Clock cycles: 55 clocks
Example Case: 135
Fastest Match cases
Clock cycles: 19 clocks
Example Case: 48
Maybe use less number of examples, so all of the results can be posted?
Additionally...
If 'CharToFind' length is greater than the 'TestingString' length, the test should not even be run.
You should test the string lengths against each other before performing the tests, imo.
Then if the 'CharToFind' length is greater then the 'TestingString' length display error message?
Example:
Clock cycles: 19 clocks
TestingString: abcdef1234567890pqrstuvabcd1234567890abcdefgxyz789abcd1234567890abcdefgxyz789abcd1234567890abcdefgxyz789abcd1234567890abcdefgxyz789abcd1234567890
TestingString Size: 145
CharToFind: abcdef1234567890pqrstuvabcd1234567890abcdefgxyz789abcd1234567890abcdefgxyz789abcd1234567890abcdefgxyz789abcd1234567890abcdefgxyz789abcd1234567890gugaxy789abcdefxyz789abcd1234567890abcdefgxyz789abcd1234567890
CharToFind Size: 207
Found at: Pos (0) - Not found
Obviously the 'CharToFind' will never be found, it is longer than the 'TestingString'. :smiley:
Also... 'CharToFind' maybe should be named 'StringToFind' instead?
Hi Zedd
Many Tks
I´ll try using less examples. I just created several of them with different lenght with odd, even and prime numbers sizes to see if the function will fail.
About the cases where the pattern is smaller than the Dataset, the function already checks this at the very beginning, i just included on the tests to see if it was not bypassing the security routines i settled for that. Those situations are checked twice inside the function.
1st - When the data input is smaller than the pattern and the function immediately exists
2nd - If the Dataset is bigger than the pattern, but on a specific point the remainder bytes to be scanned are smaller than the pattern. (Of course, after the previous bytes where correctly checked as well).
I´m optimizing it a bit more and suceedd to remove some uneeded checks in one of it´s internal functions. I´ll try one more thing and check if the results are being returned without any mismatch before converting it to masm and post it here for us. If this small fix was succeeded, i did gained something around 20% to 40% of speed
_______________________________________________
Match Min Max Timmings:
_______________________________________________
Slowest Match cases
Clock cycles: 68 clocks
Example Case: 47
Fastest Match cases
Clock cycles: 42 clocks
Example Case: 2
_______________________________________________
Not Found Min Max Timmings:
_______________________________________________
Slowest Match cases
Clock cycles: 48 clocks
Example Case: 135
Fastest Match cases
Clock cycles: 18 clocks
Example Case: 48
Press enter to exit...
About changing the label name to StringToFind. Sure, it was just to display something on screen to identify what was being done. I´ll rename it on the next tests before convert this function to masm
Okay, guga. I just thought I would mention those, just in case that they might have been overlooked or missed...
I'll look for the next test when it becomes available.
Ok, guys, new test. This time, on the print, i excluded on the display the cases where the mismatch is obvious (Pattern bigger than dataset). Of course, it didn´t affected the function at all...i just removed it from the print.
On this way we can check if the results are ok, since we have less data on the print to check. Also, i changed a bit the function and made it a bit faster. I hope this gain of speed is really due to performance and elimination of redundancies of the code and not due to mismatch on the result that could affect the speed.
If the results of this test are ok, i´ll port it to masm. Pls let me know so i can start porting it and inserting the proper comments on the function.
Btw...about the name of the app "TestingResultDeletar2a", well..it was just a dumy name for this test. I was about to delete it (that´s why i named it as "deletar" - portuguese for "to delete") if someting was wrong, but since i didn´t find wrong results so far, i kept this dummy name.
Btw..the function was created to respect the size limits of a x86 SSE2 register, so the goal was not to extrapolate it ever. On my tests, it seems to work as expected. If it really correct, then when i port it to masm, you can check for yourselfs too if any other error case do exists on larger datasets.
Again the results exceed the 20000 chacter limit for posting it, in its entirety into a post. Full results attached in zip file
Min Max:
Match Min Max Timmings:
_______________________________________________
Slowest Match cases
Clock cycles: 65 clocks
Example Case: 47
Fastest Match cases
Clock cycles: 39 clocks
Example Case: 23
_______________________________________________
Not Found Min Max Timmings:
_______________________________________________
Slowest Match cases
Clock cycles: 55 clocks
Example Case: 135
Fastest Match cases
Clock cycles: 31 clocks
Example Case: 19
Press enter to exit...
Hi Zedd
tks, but the result is empty. I guess you forgot to paste the result on the text file.
But..did you checked to see if we have any mismatch ? I mean if the function failed to find the proper patterns ?
Quote from: guga on March 18, 2025, 10:50:09 AMtks, but the result is empty. I guess you forgot to paste the result on the text file.
And I forgot to test the zip file. That was indeed my fault, sorry.
I have corrected that and reattached it above, where it was originally.
No, I have not inspected that the results were accurate that will take a little time... I'll have to get back to you in a little while.
I think I found a couple errors in the position where string was found.
I'll make a report for each case where an error found....
It'll take some time.
Tks, Btw...about the position...its´rather an index than the position. For example, on the 1st case
TestingString: 578987hGugf8ahj1g1ghdlgjGughjkhjk8ahj18ahj1g1ghdmerdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj
TestingString Size: 90
CharToFind: merdalgjg1ghdlgj8a8ahj1g1
CharToFind Size: 25
Found at: Pos (48) - merdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj
The value of 48 means we have 48 bytes before the match. Therefore the pattern was located at the 49th byte. If needed i´ll redo the printing to reflect the exact position where the pattern was found.
"578987hGugf8ahj1g1ghdlgjGughjkhjk8ahj18ahj1g1ghd" = 48 bytes (Pos)
found it at the 49th byte: merdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj
When i did the printf i simply calculated the returned address in eax and subtracted from the beginning of the data.
esi = Initial TestingString address
eax = Pattern address found
Pos = eax - esi
Okay, I will reexamine the results. I could have made a mistake too, when I looked the first time.
I am going to be away from computer for some time. It's beautiful outside, I keep going back in to the computer, and back outside...
Give me about an hour to finish checking each result properly. :smiley:
So I can enjoy the weather a little longer. :azn:
I came back inside early. I didn't want to make you wait too long... :smiley:
Okay. Indeed I did make an error when saying that "there was an error in the position where the string is found".
My mistake. The couple errors that I thought I found were the second or more occurrence of the string, not the first. The position listed is for the first occurrence, even if multiple occurrences are there.
I tested the first 40 Examples out of the 167 Examples tested. Almost 25%
The results of 38 of those are all correct as well as the positions of where that first occurence of each string is found, the 2 Examples where the string is not found the 'not found' result is also correct (Example 19 and Example 25).
I didn't go any further as I think I will not find any errors. It does indeed appear that there are no errors in the results. :thumbsup:
HI guys, i updated the 1st post and included the translation to masm of the last working version. I attached the file that i had to disassemble (as did before in other threads) which is more guaranteed to work (Just rename the xmmword to oword if needed). Additionally, i added on the post itself my attempt to port it to masm compatible syntax as well masm macros (But, i dont know if i translated correctly)
Btw, due to the limitations of characters lenght on the forum, i had to remove all the comments on the 1st post., but if needed here are the comments of the functions themselves
;;
SIMD_BinaryScan
This function performs a binary search for a chunk of data (pSearchPattern) within a larger dataset (pDataBuffer).
It uses SSE2 instructions for optimization and speed, especially with large datasets.
The function can be used to find data of any size, including strings (case-sensitive) within a larger text.
Parameters:
pDataBuffer (in): Pointer to the start of the larger dataset in which the search is performed.
This is the buffer in which you are searching.
DataSize (in): The size (in bytes) of the dataset. The size must be greater than or equal to 16 bytes.
pSearchPattern (in): Pointer to the start of the search pattern (the data chunk to search for within the dataset).
This is the smaller data chunk you are searching for.
PatternSize (in): The size (in bytes) of the search pattern.
Return Value:
Success: If the pattern is found, the function returns a pointer to the starting position of the matching chunk in the dataset.
Failure: Returns 0 (&NULL) if the pattern is not found in the dataset.
;;
;;
CompareSmallDataBlock
This function performs a byte-by-byte comparison of a small dataset (<= 16 bytes) with a search pattern.
It is used as a fallback when the dataset is too small for SIMD optimizations.
Parameters:
pDataBuffer (in): Pointer to the start of the dataset.
DataSize (in): The size (in bytes) of the dataset.
pSearchPattern (in): Pointer to the start of the search pattern.
PatternSize (in): The size (in bytes) of the search pattern.
Return Value:
Success: If the pattern is found, the function returns a pointer to the starting position of the match.
Failure: Returns 0 (&NULL) if the pattern is not found.
;;
;;
VerifyPatternMatch
This function performs a detailed comparison of a search pattern with a dataset block.
It is used after a potential match is found in the main function to confirm the match.
Parameters:
pDataBuffer (in): Pointer to the start of the dataset block.
pSearchPattern (in): Pointer to the start of the search pattern.
PatternSize (in): The size (in bytes) of the search pattern.
Return Value:
Success: If the pattern matches, the function returns a pointer to the starting position of the match.
Failure: Returns 0 (&NULL) if the pattern does not match.
;;
And below is a example (I hope) how to call this function:
; Define data and code sections
.data
; Define example data
haystack db "This is a simple string to search within. While trying to find sinp from sinples string", 0
needle db "search wi", 0
; Variables to store string lengths
haystack_len dd ?
needle_len dd ?
; Variable to store the search result
result dd ?
.code
start:
; Calculate the length of the haystack string
invoke lstrlen, addr haystack
mov haystack_len, eax
; Calculate the length of the needle string
invoke lstrlen, addr needle
mov needle_len, eax
; Call the SIMD_BinaryScan function
invoke SIMD_BinaryScan, addr haystack, haystack_len, addr needle, needle_len
mov result, eax
; Check the result
.if result == 0
; If the result is 0, the pattern was not found.
print "Pattern not found.", 13, 10
.else
; If the result is non-zero, the pattern was found.
print "Pattern found at position: "
print str$(result), 13, 10
.endif
; Terminate the program
invoke ExitProcess, 0
Hi guga,
I got the source code from the masm compatible .zip file to assemble okay. It will take a little time to get it ready to test it.
But....
But upon actually reading post #1 I see that you have a 'fixed' masm version, with proper names, etc. and should be ready to assemble. I did not read the entire post before downloading the .zip file. :tongue:
After copying each masm compatible procedure from post #1 and the macros there to a new file. I had to correct the starting line and ending line for each procedure.
After fixing that minor annoyance and adding some necessities, upon attempting to assemble the file:
Microsoft (R) Macro Assembler Version 14.00.24210.0
Copyright (C) Microsoft Corporation. All rights reserved.
Assembling: C:\Users\Administrator\Desktop\SIMD_BinaryScan_masm.asm
***********
ASCII build
***********
C:\Users\Administrator\Desktop\SIMD_BinaryScan_masm.asm(10) too large for specified size
C:\Users\Administrator\Desktop\SIMD_BinaryScan_masm.asm(10) : C:\Users\Administrator\Desktop\SIMD_BinaryScan_masm.asm(11) too large for specified size
C:\Users\Administrator\Desktop\SIMD_BinaryScan_masm.asm(11) : error A¿C:\Users\Administrator\Desktop\SIMD_BinaryScan_masm.asm(12) too large for specified size
C:\Users\Administrator\Desktop\SIMD_BinaryScan_masm.asm(12) : fatal 掟睦w_
Assembly Error
Press any key to continue . . .
I don't think ml.exe likes your macros, guga.
The macros and masm compatible procedures from post #1 are in this source code, it would have been easier if they were all in the same code block in post #1:
include \masm32\include\masm32rt.inc
.xmm
SIMD_BinaryScan proto :DWORD, :DWORD, :DWORD, :DWORD
CompareSmallDataBlock proto :DWORD, :DWORD, :DWORD, :DWORD
VerifyPatternMatch proto :DWORD, :DWORD, :DWORD
; ---------------------------------------------------------------------------
if macro condition
.if condition
endif
else macro
.else
endif
elseif macro condition
.elseif condition
endif
endif macro
.endif
endif
; MASM Macros for loops
repeat macro
.repeat
endm
until macro condition
.until condition
endm
; MASM Macros for bit manipulation
bsf macro dest, src
bsf dest, src
endm
bsr macro dest, src
bsr dest, src
endm
; MASM Macros for SSE2 operations
SSE2_EXPAND_BYTE macro xmmreg, byteval
movd xmmreg, byteval
punpcklbw xmmreg, xmmreg
punpcklwd xmmreg, xmmreg
pshufd xmmreg, xmmreg, 0
endm
CLEAR_LEFTMOST_SET16 macro reg
mov ax, reg ; Copia o valor do registrador para AX.
add reg, -1 ; Subtrai 1 do registrador (inverte todos os bits até o primeiro bit definido).
and reg, ax ; Aplica a máscara para limpar o bit mais à esquerda.
endm
GET_FIRST_BIT_SET macro reg
bsf eax, reg ; Encontra o índice do primeiro bit definido em `reg` e armazena em EAX.
endm
GET_LAST_BIT_SET macro reg
bsr eax, reg ; Encontra o índice do último bit definido em `reg` e armazena em EAX.
endm
; ---------------------------------------------------------------------------
.code
; ---------------------------------------------------------------------------
SIMD_BinaryScan proc pDataBuffer:DWORD, DataSize:DWORD, pSearchPattern:DWORD, PatternSize:DWORD
local FirstBitPos:DWORD, ScanLimit:DWORD, StringEnd:DWORD
local XmmPreserve[20]:BYTE
xor eax, eax
mov esi, [DataSize]
if (esi < [PatternSize]) || ([PatternSize] == 0)
mov eax, 0
ret
elseif ([DataSize] <= 16)
invoke CompareSmallDataBlock, [pDataBuffer], [DataSize], [pSearchPattern], [PatternSize]
ret
endif
lea eax, [XmmPreserve]
movdqu [eax], xmm0
movdqu [eax+16], xmm1
movdqu [eax+32], xmm2
movdqu [eax+48], xmm3
mov esi, [pSearchPattern]
mov ebx, [PatternSize]
movzx eax, byte ptr [esi]
SSE2_EXPAND_BYTE xmm2, eax
lea eax, [esi+ebx-1]
movzx eax, byte ptr [eax]
SSE2_EXPAND_BYTE xmm3, eax
mov esi, [pDataBuffer]
mov edi, [DataSize]
mov eax, [PatternSize]
shr eax, 4
add eax, 16
sub edi, eax
if edi > 16
mov edi, [DataSize]
sub edi, [PatternSize]
endif
lea eax, [esi+edi]
mov [ScanLimit], eax
mov eax, [DataSize]
lea eax, [esi+eax]
mov [StringEnd], eax
.repeat
mov [FirstBitPos], 0
movups xmm0, [esi]
mov eax, [PatternSize]
movups xmm1, [esi+eax-1]
pcmpeqb xmm0, xmm2
pcmpeqb xmm1, xmm3
pand xmm1, xmm0
pmovmskb ebx, xmm1
.if ebx != 0
.repeat
GET_FIRST_BIT_SET ebx
mov [FirstBitPos], eax
lea edi, [esi+eax]
invoke VerifyPatternMatch, edi, [pSearchPattern], [PatternSize]
.if eax != 0
jmp L2
.endif
CLEAR_LEFTMOST_SET16 bx
.until bx == 0
mov eax, [FirstBitPos]
add eax, [PatternSize]
.else
pmovmskb eax, xmm0
.if eax != 0
GET_LAST_BIT_SET eax
inc eax
.else
mov eax, 16
.endif
.endif
add esi, eax
.until esi > [ScanLimit]
xor eax, eax
.if [FirstBitPos] == 0
mov eax, [StringEnd]
sub eax, esi
.if [PatternSize] <= eax
invoke CompareSmallDataBlock, esi, eax, [pSearchPattern], [PatternSize]
.else
xor eax, eax
.endif
.endif
L2:
lea esi, [XmmPreserve]
movdqu xmm0, [esi]
movdqu xmm1, [esi+16]
movdqu xmm2, [esi+32]
movdqu xmm3, [esi+48]
ret
SIMD_BinaryScan endp
; ---------------------------------------------------------------------------
CompareSmallDataBlock proc pDataBuffer:DWORD, DataSize:DWORD, pSearchPattern:DWORD, PatternSize:DWORD
local EndString:DWORD
mov esi, [pDataBuffer]
mov eax, [DataSize]
lea eax, [esi+eax]
sub eax, [PatternSize]
mov [EndString], eax
mov edi, [pSearchPattern]
.repeat
mov ecx, [PatternSize]
xor eax, eax
.repeat
movzx edx, byte ptr [esi+eax]
cmp dl, byte ptr [edi+eax]
jne L1
inc eax
dec ecx
.until ecx == 0
mov eax, esi
ret
L1:
inc esi
.until esi > [EndString]
xor eax, eax
ret
CompareSmallDataBlock endp
; ---------------------------------------------------------------------------
VerifyPatternMatch proc pDataBuffer:DWORD, pSearchPattern:DWORD, PatternSize:DWORD
local Reminder:DWORD
mov edx, [PatternSize]
mov esi, [pDataBuffer]
mov edi, [pSearchPattern]
and edx, -16
mov eax, edx
xor edx, [PatternSize]
.if edx != 0
lea esi, [esi+eax]
lea edi, [edi+eax]
mov dh, dl
and dl, -4
xor dh, dl
.if dl != 0
mov eax, [esi]
cmp eax, [edi]
jnz L4
add esi, 4
add edi, 4
add dl, -4
jz L2
mov eax, [esi]
cmp eax, [edi]
jnz L4
add esi, 4
add edi, 4
add dl, -4
jz L2
mov eax, [esi]
cmp eax, [edi]
jnz L4
add esi, 4
add edi, 4
.endif
L2:
.if dh != 0
mov eax, 0
mov dl, byte ptr [esi]
cmp dl, byte ptr [edi]
jnz L4
dec dh
jz L2
mov dl, byte ptr [esi+1]
cmp dl, byte ptr [edi+1]
jnz L4
dec dh
jz L2
mov dl, byte ptr [esi+2]
cmp dl, byte ptr [edi+2]
jnz L4
.endif
mov eax, [pDataBuffer]
.else
mov eax, esi
.endif
L4:
ret
VerifyPatternMatch endp
start:
end start
I can go no further with this.
Hi Larry
Tks a lot
Better would use the disassembled version, until someone port it to the default masm macros set. I gave a try porting it, but i´m not used to masm syntax since years, and had to translate the macros with ChatGPT (In fact, deepseek did the translation).
But, if you are used to masm macros system, it shouldn't be hard to port. RosAsm If macro set works the same as in Masm version the If_Or and If_And macros in RosAsm with maybe similar to If xxx= 5 & yy= 6 or If xxx= 10 OR yyy = 30 (If i remember, masm If macro set allowed something like that, right ? I mean, using multiple If variables on the same line with some boolean &, OR, right ?)
The while and end_while macro in RosAsm, is the same as while endwhile in masm, except it don´t accept multiple conditions.
The do loop_until macro in RosAsm is also similar to while macro in masm, except it don´t have the jmp opcode at the start. so, it works like:
Start:
mov eax 5
dec eax
cmp eax 0 | jne Start
or je, jle etc etc depending teh used RosAsm token, but the functionality is like that.
The uses macro in rosasm is just to preserve registers on a procedure. It´s a simple push reg at the start of a function and correspondant pop reg at the end (i guess masm has also something like that).
In RosAsm to make the source better to be read we can use separators "|" to dislay the code on a same line. THis is unused during compilation time, it´s just a token to make the code more readable, specially if the code have several lines, and we want it do display in one only. So
mov eax 5000 | shr eax 3 | inc eax
is the same as
mov eax 5000
shr eax 3
inc eax
The usage of "," in between the opcodes is optional. So you can also write it as:
mov eax, 5000
shr eax, 3
inc eax
Or any equivalent in Masm. - I personally, don´t use the comma´s too often for short opcodes. I use it more in call function with several parameters. But in RosAsm, it is optional, it won´t affect at all the code since it is handled by the assembler to simply skip those chars on certain conditions.
Now, if the function really work as expected, now it´s time to do the backwards version of it. So, a mirror function to works backwards searching the pattern from end to the beginning of the data chunk. I already did that, but since i updated the forward version, i´ll try to see if i can do the same with the backwards
I'm not very good with 'real' macros, just simple macros.
Using the disassembled version might assemble fine, and actually work, but it would require some modifications regarding all of the 'loc_004xxxxx' labels, etc. and a few other things to make it a proper source code. I don't have a lot of time to do all of that (I have projects of my own to work on), but yes it is possible to do.
Hopefully, someone else here can help you to make a proper Masm32 compatible version of these functions. It looks like a very promising and useful set of functions. :thumbsup:
Cheers...
Larry
Congratulations Guga for SIMD_BinaryScan (equivalent to Instring), :thumbsup:
It's great that you continue to have fun and feed the forum with interesting 32 bit code that our eminent translator zed151 tests and comments on. :badgrin:
I'm also grateful for the IDA translation into MasmCompatible.asm (unfortunately without any comments on the individual lines). :undecided:
For me, this is a perfect example of amateurish attitude towards the Instring function, because:
1. There is no check in the forum or by AI for an existing function written by a professional.
2. Splitting a simple function into three separate functions, etc. :smiley:
I'll stop here and suggest the following:
1. Take a look at the following elegant function
;*******************************************************************;
; Immediate byte constants
;EQUAL_ANY = 0000b
;RANGES = 0100b
;EQUAL_EACH = 1000b
;EQUAL_ORDERED = 1100b
;NEGATIVE_POLARITY = 010000b
;BYTE_MASK = 1000000b
;*******************************************************************************************************;
align 16
db 8 dup (90h)
InstrOg proc
mov rax, rcx ; rcx = haystack, rdx = needle
movdqu xmm2, oword ptr[rdx] ; load 1st 16 bytes of needle
pxor xmm3, xmm3
@Loop: ; find the first possible match of 16-byte fragment in haystack
pcmpistri xmm2, oword ptr[rax], EQUAL_ORDERED
lea rax, [rax+16]
ja @Loop
lea rax, [rax+rcx-16] ; save the possible match start
mov r9, rdx ; r9 = rdx rdx = needle
mov r8, rax
jnc @NotFound
sub r9, rax
@@: ; compare the strings
movdqu xmm1, oword ptr [r8 + r9]
pcmpistrm xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK ; mask out invalid bytes in the haystack
movdqu xmm4, oword ptr[r8]
pand xmm4, xmm0
pcmpistri xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
lea r8, [r8+16]
ja @b
jnc @Found
add rax,1 ; continue searching from the next byte
jmp @Loop
@NotFound:
xor eax, eax
@Found:
ret
InStrOg endp
;*****************************************************************;
2. If you like it, just request a translation to MASM32 and testing from zed151.
I hope you continue to have fun and feed the forum with interesting code. :thumbsup:
Tks, but Couldn´t insert the comments on each line due to the limitations of the amount of characters on the forum. But, the comments on each line are displayed ion the RosAsm version, so it wouldn´t be hard to follow.
"For me, this is a perfect example of amateurish attitude towards the Instring function, because:
1. There is no check in the forum or by AI for an existing function written by a professional."
About the code...amatory attitude ? Not at all, there are papers describing the same method I used, scanning the data from 16 bytes to 16 bytes. Using 3 functions was mainly to avoid extrapolating the limits of a SSE register and also working on small datasets and small patterns. This function is not yet to be used as a instring on the normal way, since it does not look for null terminated byte, it is used to scan for a chunk of data (pattern) of any size on a dataset also of any size.
Also, it is for 32 bits and not 64 and so far, I´m limited to test with SSE2 mainly.
Your code, although shorter, is for 64 bits, using SSE4, but is limited to 16-byte aligned strings, needles bigger or equal to 16 bytes, null-terminated strings, cases where haystack is large enough, since it seems to not check the size of the pattern or the dataset itself.
(Unless, the code you posted is the core of another function that already check those limitations)
I would like to test it, but I can't port it to 32 bits and it is needed (for now) to be limited to SSE2.
I´ll try to port it to 32 bits using only SSE2 and if succeeded, I´ll post here the results
I gave a test trying to port it to 32 bits, but unsure if it is the proper equivalence. If the code below is the equivalent for your´s 64 bits version, then, it is not working:
Is this the proper porting to 32 bits using only SSE2 ?
Proc InstrOg:
Arguments @pHaystack, @pNeedle
Uses esi, edi, ebx
mov eax D@pHaystack ; eax = haystack
mov esi D@pNeedle
movdqu xmm2 X$esi ; load 1st 16 bytes of needle into xmm2
pxor xmm3 xmm3 ; xmm3 = 0 (used for comparisons)
@MainLoop:
; Compare 16 bytes of haystack with the needle
movdqu xmm1 X$eax ; load 16 bytes of haystack into xmm1
pcmpeqb xmm1 xmm2 ; compare bytes for equality
pmovmskb ecx xmm1 ; create a bitmask of the comparison results
test ecx ecx ; check if any bytes matched
jnz @PossibleMatch ; if any match, jump to PossibleMatch
add eax 16 ; move to the next 16-byte block in haystack
jmp @MainLoop ; continue searching
@PossibleMatch:
; A possible match was found. Now verify the full string.
mov esi eax ; esi = start of the possible match in haystack
mov edi D@pNeedle ; edi = needle
mov ebx 16 ; ebx = number of bytes to compare
@VerifyLoop:
movdqu xmm1 X$esi ; load 16 bytes of haystack into xmm1
movdqu xmm4 X$edi ; load 16 bytes of needle into xmm4
pcmpeqb xmm1 xmm4 ; compare bytes for equality
pmovmskb ecx xmm1 ; create a bitmask of the comparison results
cmp ecx 0FFFF ; check if all 16 bytes matched
jne @NoMatch ; if not, jump to NoMatch
add esi 16 ; move to the next 16-byte block in haystack
add edi 16 ; move to the next 16-byte block in needle
sub ebx 16 ; decrement the remaining bytes to compare
jnz @VerifyLoop ; if more bytes to compare, continue
; Full match found
mov eax esi ; return the start of the match
jmp @Done
@NoMatch:
; No match, continue searching from the next byte
add eax 1 ; move to the next byte in haystack
jmp @MainLoop ; continue searching
@Done:
; Restore registers and return
mov eax esi ; return the start of the match (or 0 if not found)
EndP
[TestingString2: B$ "578987hGugf8ahj1g1ghdlgjGughjkhjk8ahj18ahj1g1ghdmerdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj", 0]
[CharToFind2: B$ "merdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj", 0]
call InstrOg TestingString2, CharToFind2
Hi,guga
His codes works fine.
.data
; Define example data
haystack db "578987hGugf8ahj1g1ghdlgjGughjkhjk8ahj18ahj1g1ghdmerdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj", 0
needle db "merdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj", 0
.code
;¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
InStrOg proc plBuf:QWORD,plNed:QWORD
mov rax, rcx ; rcx = haystack, rdx = needle
movdqu xmm2, oword ptr[rdx] ; load 1st 16 bytes of needle
pxor xmm3, xmm3
@Loop: ; find the first possible match of 16-byte fragment in haystack
pcmpistri xmm2, oword ptr[rax], EQUAL_ORDERED
lea rax, [rax+16]
ja @Loop
lea rax, [rax+rcx-16] ; save the possible match start
mov r9, rdx ; r9 = rdx rdx = needle
mov r8, rax
jnc @NotFound
sub r9, rax
@@: ; compare the strings
movdqu xmm1, oword ptr [r8 + r9]
pcmpistrm xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK ; mask out invalid bytes in the haystack
movdqu xmm4, oword ptr[r8]
pand xmm4, xmm0
pcmpistri xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
lea r8, [r8+16]
ja @b
jnc @Found
add rax,1 ; continue searching from the next byte
jmp @Loop
@NotFound:
xor rax, rax
@Found:
ret
InStrOg endp
...
invoke InStrOg, addr haystack, addr needle
mov result, rax
; Check the result
.if result == 0
; If the result is 0, the pattern was not found.
invoke AddLog,CStr("Pattern not found.",13,10)
.else
; If the result is non-zero, the pattern was found.
invoke RtlZeroMemory,addr szBuf,sizeof szBuf
mov rax,result
lea rcx,haystack
sub rax,rcx
invoke wsprintf,addr szBuf,CStr("Pattern found at position(offset): %d ",13,10),rax
invoke AddLog,addr szBuf
.endif
QuotePattern found at position(offset): 48
Hi Six_L. Tks for testing
So, it was how i translate it to 32 bits that isn't working. I cannot test 64 bits apps in RosAsm right now. If someone could port this to 32 bits and with SSE2 it could be better for test. But, his version limits the size of strings and patterns or extrapolates the boundaries of a SSE register ? Those are the issues i would like to check.
QuoteIf someone could port this to 32 bits and with SSE2 it could be better for test.
And wouldn't it be better if you changed it to:
If someone could port my code to 64 bits it could be better for test. Аnyway... :thumbsup:
Thanks
six_L for your translation, but I don't understand why it includes 2 unnecessary variables that are not used anywhere in the code. I mean
plBuf:QWORD and
plNed:QWORD. :smiley:
Hi,ognil
QuoteI don't understand why it includes 2 unnecessary variables that are not used anywhere in the code. I mean plBuf:QWORD and plNed:QWORD.
I like the macros. Using the macros makes it easier to grasp the correctness of code from a macroscopic angles. if i have not used the "invoke" macros. then the code change the following:
invoke InStrOg, addr haystack, addr needle
mov result, rax
==>
mov rcx,offset haystack
mov rdx,offset needle
call InStrOg
mov result, rax
Regards
six_L
Thanks to six_L for trying to explain the presence of two unused variables in your code.
Unfortunately, this is wrong, because there is nothing different in the result of the two calls you showed, from the debugger's perspective:
invoke InStrOg,addr szText,addr szText31 generate in debugger MS VC:
000000014000113D 48 8D 0D 6C 8D 06 00 lea rcx,[szText (0140069EB0h)]
0000000140001144 48 8D 15 74 8D 06 00 lea rdx,[szText31 (0140069EBFh)]
000000014000114B E8 3D 82 00 00 call InStrOg (014000938Dh)
In other words, the macro does not write the variables plBuf:QWORD and plNed:QWORD into the code!!.
This is done by the programmer additionally, without being necessary in this case.
This is one of the reasons why I do not use macros, because then a mandatory check with the debugger must be done.
/Fl⟦filename⟧ Generates an assembled code listing. See /Sf.
Quote from: ognil on March 20, 2025, 05:40:18 AM2. Quotemov rcx, offset haystack ; mov rdx, offset needle
Тhis may or may not work depending on how big the variable address is in memory.
Isn't this a questionable statement? "MOV reg64,imm" is actually one of the few instructions that allow 64-bit immediates as source operand (so "mov rdx,offset xxx" is supposed to work for ANY 64-bit address). OTOH, "LEA reg64, [mem64]" has a displacement operand of 32-bit only, so it's restricted to either PC +/- 31-bit offset or, for non-PC-relative addressing, to address the first 4 GB only.
Hi,ognil
1,
QuoteIn other words, the macro does not write the variables plBuf:QWORD and plNed:QWORD into the code!!.
This is done by the programmer additionally, without being necessary in this case.
your "InStrOg" PROC have two parameters[in]. i added the variables plBuf:QWORD and plNed:QWORD into the code for receiving the inputted parameters.
"addr haystack" --> "plBuf" --> "rcx"
"addr needle" --> "plNed" --> "rdx"
Another way(remove the variables plBuf:QWORD and plNed:QWORD):
InStrOg proc
mov rax, rcx ; rcx = haystack, rdx = needle
movdqu xmm2, oword ptr[rdx] ; load 1st 16 bytes of needle
pxor xmm3, xmm3
@Loop: ; find the first possible match of 16-byte fragment in haystack
pcmpistri xmm2, oword ptr[rax], EQUAL_ORDERED
lea rax, [rax+16]
ja @Loop
lea rax, [rax+rcx-16] ; save the possible match start
mov r9, rdx ; r9 = rdx rdx = needle
mov r8, rax
jnc @NotFound
sub r9, rax
@@: ; compare the strings
movdqu xmm1, oword ptr [r8 + r9]
pcmpistrm xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK ; mask out invalid bytes in the haystack
movdqu xmm4, oword ptr[r8]
pand xmm4, xmm0
pcmpistri xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
lea r8, [r8+16]
ja @b
jnc @Found
add rax,1 ; continue searching from the next byte
jmp @Loop
@NotFound:
xor rax, rax
@Found:
ret
InStrOg endp
...
; Call the SIMD_BinaryScan function
lea rcx,haystack
lea rdx,needle
invoke InStrOg
mov result, rax
; Check the result
.if result == 0
; If the result is 0, the pattern was not found.
invoke AddLog,CStr("Pattern not found.",13,10)
.else
; If the result is non-zero, the pattern was found.
invoke RtlZeroMemory,addr szBuf,sizeof szBuf
mov rax,result
lea rcx,haystack
sub rax,rcx
invoke wsprintf,addr szBuf,CStr("Pattern found at position(offset): %d ",13,10),rax
invoke AddLog,addr szBuf
.endif
2,
in UASM64
"lea rcx,haystack" == "mov rcx, offset haystack"
Regards
six_L
Thanks guys, :smiley:
I don't want to argue with you because this thread is basically cleared of comments and we should respect the work of others. When you post your own competing code against Guga's code, I would be happy to comment on it in detail. :thumbsup:
New version coming soon. Succeeded to make no errors so far, and kept the speed for both (match and non match cases). Also succeeded (finally) to force the data to always stays within the limits of a SSE registers. On this way we have 0 chance to an error when trying to load data beyond the limits.. Ex:
[DataSet: B$ "Hello world, i´m trying to find a string", 0] ; last byte is "g" and not 0
[Pattern: B$ "string", 0] ; 1st and last chars are s and g, respectively
On this, way, if the loop happens to be in "ind a string" - < pointer in esi is here. Then, when it try to load in xmm0, since this chunk has only 12 bytes long, it cannot load the extra 13 to 16 bytes. (We are not loading only strings, and we have no guarantees that we still have enough memory after the ending "g" to be loaded, since the data can be located at the very end of a block of memory or physical data (ex: we are trying to locate a chunk of bytes on the end of the .data section of a PE or other type of file and the memory ends immediately after the "g")
I succeeded to get no overheads on the algorithm. Now, i´m trying to improve it a little bit, tweaking how it can handle now chunks of data smaller than 31 bytes long, since those are related both as the remainder bytes of a long chain of data and also as small datasets to be scanned.
So, far, in both i´ve got the worst speed of around 90-96 clock cycles on the 167 types of data used to scan and around 40-43 clock cycles the fastest on my AMD.
Slowest Match cases
Clock cycles: 96 clocks
Example Case: 35
Fastest Match cases
Clock cycles: 43 clocks
Example Case: 21
_______________________________________________
Not Found Min Max Timmings:
_______________________________________________
Slowest Match cases
Clock cycles: 73 clocks
Example Case: 19
Fastest Match cases
Clock cycles: 32 clocks
Example Case: 83
Press enter to exit...
I´ll see if the approach i´m planning to handle those data smaller than 32 bytes will have any impact on the performance of the algorithm. I´ll try creating a hybrid algorithm that can handle only this small blocks of data using SSE2 registers and normal x86 registers (for even smaller chunks) to see if it is worthful than using the currently function as (Which is a bit slow, btw):
Proc CompareSmallDataBlock:
Arguments @pDataBuffer, @DataSize, @pSearchPattern, @PatternSize
Local @EndString
Uses esi, edi, edx, ecx
; Initialize pointers.
mov esi D@pDataBuffer
mov eax D@DataSize
lea eax D$esi+eax | sub eax D@PatternSize | mov D@EndString eax
mov edi D@pSearchPattern
.Do
; Compare the pattern byte-by-byte.
mov ecx D@PatternSize
xor eax eax
Do
movzx edx B$esi+eax | cmp dl B$edi+eax | jne L1>
inc eax
dec ecx
Repeat_Until_Zero
; If the pattern matches, return the starting position.
mov eax esi
ExitP
L1:
; Move to the next position in the dataset.
inc esi
.Loop_Until esi > D@EndString
; If no match is found, return 0.
xor eax eax
EndP
Note: I tested the suggested routines described in several papers (Link1 (https://github.com/WojciechMula/sse4-strstr/blob/master/avx2-strstr-v2.cpp#L107) , link2 (http://0x80.pl/notesen/2016-11-28-simd-strfind.html#generic-sse-avx2) etc etc), but all of them failed in terms of performance, speed and reliability, since they all extrapolated the boundaries of SSE2 registers. Also in some of those papers they suggested that we could ignore the comparison of the 1st and last char before it pass to the core compare function, but this is useless. No actual gain of speed whatsoever. Forcing it to recognize the size of the dataset to just advance 1 bytes and reduce the size in 2, so far caused no impact. IN fact, it leaded the algo to be a bit slower in 3 or 4 clock cycles.
The technique described in those papers are similar to the one i created, except to the fact that my algo respects the limits for a SSE2 register and also is automatically adjustable to any size of data and any size of pattern. The problem with many of those algorithms i tested so far (the one in github as well), is that when it enter on the loop to compare the possible matches, it does not calculate only the matching pairs. Instead, it perform more loops than necessary, and also don´t check the true limit of the dataset , which is not only the cause of the limit extrapolation but also make the functions to be slower because it continues to loop until it reaches the true end of the data, rather than estimate the maximum limit to be performed. For, instance, if the pattern is 10 bytes long, and after all the loops, the ending pointer still needs to check for some remainders. The algorithm must terminate if the rest of the data to be scanned is smaller than the pattern. I.e: say we reach the limit for the remainders, and there´s still 5 bytes to the real end of the dataset. If the pointer is located somewhere between the 10th byte of true end of the data and the limit, then there´s no room to scan, because we cannot have any pattern beyond that. Ex: esi ends in 10 but the true data ending is located in 17. The pattern size = 10 bytes, but, if we are at the end of the limit, the amount of bytes to the end is only 7 bytes, so smaller then the patternsize itself.
Attached the new benchmark version to test.
Hi Guys
i updated the algorithm on the 1st post. Also inserted comments and the correspondent masm version to be tested (just replace the xmmword with oword - or equivalent for SSE2 in masm).
As reported on the 1st post Link (https://masm32.com/board/index.php?topic=12630.0), if anyone else wants to give a try on this algo, or improve it for SSE2 obeying the limits of a xmm register (if possible), here are a couple of things that may be done:
1 - Create a flag before the loop for the cases where the pattern is only 2 bytes. True = Match Pairs ok. False = Not yet, the pattern is bigger than 2 bytes. This flag can be accessed right after the .Test_If_Not_Zero edx, something like.:
(...)
.Test_If_Not_Zero edx
Do
; Find the first set bit in the bitmask.
GET_FIRST_BIT_SET edx | mov D@FirstBitPos eax
cmp D@PatternSize 2 | jz L2>> So, since the pattern is only 2 bytes and we already found them from the SIMD computation, there´s no need to go further onto the call to the helper functions.
2 - make usage of xmm4 and xmm5 to store the 2nd and penultimate byte from the pattern and extend them to the correspondent SSE registers. Then on the SImD comparison , use registers xmm6 and xmm7 to store them and also be compared as we did for xmm0 and xmm1. On this way, we will compare at once, 4 bytes from the pattern instead of 2. And then simply add another pand between the resultant pairs and create the mask on edx as we did in pmovmskb edx XMM1. The problem is that, if you use the 2nd and penultimate byte, you also needs to adjust the pointer to esi to they points to the proper places, something like:
movq XMM0 X$esi ; Load the first 16 bytes of the current block into xmm0. Var_BlockFirst.
movq XMM1 X$esi+ebx-1 ; Load the last 16 bytes of the current block into xmm1. Var_BlockLast.
movq XMM6 X$esi+1 ; Load the 2nd 16 bytes of the current block into xmm0. Var_BlockFirst.
movq XMM7 X$esi+ebx-2 ; Load the penultimate 16 bytes of the current block into xmm1. Var_BlockLast.
The main problem will relies in obeying the boundaries of the registers to avoid it tries to point to addresses outside the limits of the data. So, you most likely will need to use another x86 register to adjust those pointers dynamically. Something like:
movq XMM0 X$esi ; Load the first 16 bytes of the current block into xmm0. Var_BlockFirst.
movq XMM1 X$esi+ebx-1 ; Load the last 16 bytes of the current block into xmm1. Var_BlockLast.
mov eax esi | add eax D@SecondByteFlag
movq XMM6 X$eax ; Load the 2nd 16 bytes of the current block into xmm6. Var_BlocSecond.
mov eax esi | sub eax D@PenultimateByteFlag
movq XMM7 X$eax+ebx ; Load the penultimate 16 bytes of the current block into xmm7. Var_BlockPenultimate.
Here on this example, those flags represents the amount of bytes to advance or subtract according to the size of the pattern. This must be done before the loop as well. Something like:
- If pattern = 2 bytes, D@SecondByteFlag = 0, D@PenultimateByteFlag = 1 ; always starts with 1 subtraction
- If pattern = 3 bytes, D@SecondByteFlag = 1, D@PenultimateByteFlag = 1 ; Pattern is odd, so we need only to point the last byte to itself (-1 as above)
- If pattern >= 4 bytes, D@SecondByteFlag = 1, D@PenultimateByteFlag = 2 ; Pattern is bigger or equal to4 bytes (odd or even, it won´t matter), so we need only to compare the 2 pairs of the start of the pattern and 2 pairs at the end, regardless the size of the pattern.
In theory, this could work and force the algorithm to advance 16 bytes more often making it be a bit faster, since the chances of 4 matching pairs at those same positions are low in real life applications, in most of the cases (Specially if the data set is composed of random bytes).
The only problem i see in adding those 2 more techniques is if that will really increase more the speed or we just added overheads to the algo, making it slower.
Results from running executable in "BinScan24Mar2025.zip" attached as a .txt file (not zipped)
Tks a lot, Zedd :thumbsup: :thumbsup: :thumbsup:
It seems to be working as expected now. Faster and handling all the data. I´m currently working on the reversed function now. I thought it was easier to develop the exact mirror of SIMD_BinaryScan, since i succeeded to fix the original version, but it´s a bit hard to make it stays in the limits when working backwards. I´m trying to see why this is happening, and when succeed, i´ll post here also the backwards function.
Those functions to properly scan data (forward or backward) are the ones i need to fix soon because i´m trying to use them not only inside RosAsm, but making them in general usage as well. Im currently trying to fix several old functions in RosAsm to release a newer version (If i succeed to make the functions works more independently from each other and create the proper dlls for that). I need to fix RosAsm and make it easier to handle the disassembler, debugger, decoder and encoder and also create the proper internal tables to handle the necessary data. On this way, it could be easier to someday - eventually - i decide to create a 64bit version for RosAsm as well. Personally, i´m not a big fan of 64 bits apps, but they are in fact needed nowadays, specially when developing plugins or small apps to work with image, audio and video processing. But, making a 64 bit version of RosAsm is impossible to port from the current Source code - That´s why i´m trying to fix it.