Hi JJ
how do i use Instring ?
I´m having troubles trying to use in order to see if i can port it. Do you have the plain masm code for it ?
I´m trying to make a simple string search inside a text like this:
[INSTR_CASE 0] ; Is this the proper equate/flag ?
[INSTR_NO_CASE 2] ; Is this the proper equate/flag ?
[INSTR_INTELLISENSE 64] ; Is this the proper equate/flag ?
[BigString: B$ "Hello, my name is guga and i want to buy a car", 0]
[SmallString: B$ "guga", 0]
call InStringSSE2 BigString, SmallString, INSTR_NO_CASE
The goal is to search for the string "guga" (case sensitive,. case insensitive and intellisense) inside the main bigger string "Hello, my name is guga and i want to buy a car" and return the position (or address which would be better) where the string was found
What is the proper syntax for it ? I tested with masmbasic, but without any suces
Instr_(...) (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1153):
mov pos, Instr_(1, L$(n), "equ", 1+4) ; 1=start pos, 1=case-insensitive + 4=full word
Rem - returns relative pos in edx, absolute in eax
- if no match is found, zero is returned in edx, and eax points to the start of the 'haystack'; same if 'needle' is empty
- six syntax variants allowed:
A: has startpos: Instr_(1, "Test", "Te", 2) ; 4 args
B: no startpos: Instr_("Test", "Te", 2) ; 3 args, last one immediate = mode
C: has startpos: Instr_(1, "Test", "Te") ; 3 args, last one not immediate
D: no startpos: Instr_("Test", "Te") ; 2 args
) ; 2 args
E: test several patterns: InstrOr("Test", "Te" or "st", 1) ; 3 args (startpos is 1)
F: extra fast mode: Instr_(FAST, My$(ecx), "Hello", 0) ; 4 args, no startpos, modes 0+2 only
- case & mode (bitwise flag):
0=case-sensitive, +1=insensitive, +2=intellisense (Name=name, i.e. case of first char ignored),
+4=full word search, +8=include start of line in text block search, +4+16=not full word
InString is a Masm32 SDK algo, also used with the Masm32 find$(big$, small$) macro.
InStr_() is a MasmBasic macro.
Full test:
include \masm32\MasmBasic\MasmBasic.inc
SetGlobals BigString, SmallString
Init
Let BigString="Hello, my name is guga and i want to buy a car"
Let SmallString="guga"
.if Instr_(BigString, SmallString)
xchg eax, ecx
Inkey "[", Left$(ecx, Len(SmallString)), "]"
.endif
EndOfCode
Shorter:
include \masm32\MasmBasic\MasmBasic.inc
SetGlobals Big$="Hello, my name is guga and i want to buy a car", Small$="guga"
Init
.if Instr_(Big$, Small$)
xchg eax, ecx
Inkey "[", Left$(ecx, Len(Small$)), "]"
.endif
EndOfCode
The xchg eax, ecx is needed because Len() destroys resp. returns eax.
Hello sir guga;
Search by Boyer Moore into this board. Need some optimizations link bellow:
https://masm32.com/board/index.php?msg=109922
Sir jj2007 have done some tests, you can check results from that post to front.
Quote from: mineiro on August 04, 2023, 08:21:15 AMHello sir guga;
Search by Boyer Moore into this board. Need some optimizations link bellow:
https://masm32.com/board/index.php?msg=109922
Sir jj2007 have done some tests, you can check results from that post to front.
Timings are here (https://masm32.com/board/index.php?topic=10040.msg109942#msg109942). Example:
... using a 6 bytes substring [Doplic]
599 ms for CRT strstr result=105560497
563 ms for M32 InString result=105560497
181 ms for BMHBinsearch result=105560497
158 ms for BMBinSearch result=0
420 ms for Boyer-M Mineiro 1 result=105560497
417 ms for Boyer-M Mineiro 2 result=105560497
61 ms for MB Instr_() result=105560497
.const
EQUAL_ANY equ 0000b
RANGES equ 0100b
EQUAL_EACH equ 1000b
EQUAL_ORDERED equ 1100b
NEGATIVE_POLARITY equ 010000b
BYTE_MASK equ 1000000b
.code
align 16
InstrLingo proc haystack:DWORD,needle:DWORD
mov ecx, haystack ; ecx = haystack, edx = needle
push esi
mov edx, needle
push edi
mov eax, ecx
movdqu xmm2, oword ptr[edx] ; load 1st 16 bytes of needle
pxor xmm3, xmm3
@Loop: ; find the first possible match of 16-byte fragment in haystack
pcmpistri xmm2, oword ptr[eax], EQUAL_ORDERED
lea eax, [eax+16]
ja @Loop
lea eax, [eax+ecx-16] ; save the possible match start
mov edi, edx
mov esi, eax
jnc @NotFound
sub edi, eax
@@: ; compare the strings
movdqu xmm1, oword ptr [esi + edi]
pcmpistrm xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK ; mask out invalid bytes in the haystack
movdqu xmm4, oword ptr[esi]
pand xmm4, xmm0
pcmpistri xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
lea esi, [esi+16]
ja @b
jnc @Found
add eax, 1 ; continue searching from the next byte
jmp @Loop
@NotFound:
xor eax, eax
@Found:
pop edi
pop esi
ret
InstrLingo endp
:skrewy: :biggrin:
Hi Lingo,
Not surprisingly (https://masm32.com/board/index.php?topic=9237.msg117041#msg117041), your code performs exactly like MasmBasic Instr_(), but how come it returns a different position? Did you forget to subtract the start position?
Intel(R) Core(TM) i5-2450M CPU @ 2.50GHz
Finding a string 5 times in a 100MB 'haystack' from\Masm32\include\Windows.inc
... using a 8 bytes substring [Doplicat]
603 ms for CRT strstr result=105560497
586 ms for M32 InString result=105560497
162 ms for BMHBinsearch result=105560497
134 ms for BMBinSearch result=0
323 ms for Boyer-M Mineiro 1 result=0
324 ms for Boyer-M Mineiro 2 result=0
60 ms for Lingo result=143964624
63 ms for MB Instr_() result=105560497
... using a 7 bytes substring [Doplica]
624 ms for CRT strstr result=105560497
588 ms for M32 InString result=105560497
172 ms for BMHBinsearch result=105560497
146 ms for BMBinSearch result=105560497
374 ms for Boyer-M Mineiro 1 result=105560497
369 ms for Boyer-M Mineiro 2 result=105560497
62 ms for Lingo result=143964624
63 ms for MB Instr_() result=105560497
... using a 6 bytes substring [Doplic]
610 ms for CRT strstr result=105560497
581 ms for M32 InString result=105560497
186 ms for BMHBinsearch result=105560497
162 ms for BMBinSearch result=0
423 ms for Boyer-M Mineiro 1 result=105560497
427 ms for Boyer-M Mineiro 2 result=105560497
63 ms for Lingo result=143964624
62 ms for MB Instr_() result=105560497
Hi lingo
Thanks for this algo, it works perfectly.
invoke InstrLingo, addr longString, addr shortString
lea ecx, longString ; eax returns Address of shortString
sub eax, ecx
mov BytePosition, eax ; now eax = Position of shortString
Awrsome. :thumbsup:
Quote from: Caché GB on August 05, 2023, 10:10:20 PMHi lingo
Thanks for this algo, it works perfectly.
It was added to MasmBasic Instr_() (https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1153) over a year ago. Credits should go to Peter Kankowski (https://www.strchr.com/strcmp_and_strlen_using_sse_4.2) (13 years ago):
strstr_sse42:
; ecx = haystack, edx = needle
push esi
push edi
MovDqU xmm2, dqword[edx] ; load the first 16 bytes of neddle
Pxor xmm3, xmm3
lea eax, [ecx - 16]
; find the first possible match of 16-byte fragment in haystack
STRSTR_MAIN_LOOP:
add eax, 16
PcmpIstrI xmm2, dqword[eax], EQUAL_ORDERED
ja STRSTR_MAIN_LOOP
jnc STRSTR_NOT_FOUND
add eax, ecx ; save the possible match start
mov edi, edx
mov esi, eax
sub edi, esi
sub esi, 16
; compare the strings
@@:
add esi, 16
MovDqU xmm1, dqword[esi + edi]
; mask out invalid bytes in the haystack
PcmpIstrM xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK
MovDqU xmm4, dqword[esi]
PAnd xmm4, xmm0
PcmpIstrI xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
ja @B
jnc STRSTR_FOUND
; continue searching from the next byte
sub eax, 15
jmp STRSTR_MAIN_LOOP
STRSTR_NOT_FOUND:
xor eax, eax
STRSTR_FOUND:
pop edi
pop esi
ret
Compare that to Lingo's code:
InstrLingo proc haystack:DWORD,needle:DWORD
mov ecx, haystack ; ecx = haystack, edx = needle
push esi
mov edx, needle
push edi
mov eax, ecx
movdqu xmm2, oword ptr[edx] ; load 1st 16 bytes of needle
pxor xmm3, xmm3
@Loop: ; find the first possible match of 16-byte fragment in haystack
pcmpistri xmm2, oword ptr[eax], EQUAL_ORDERED
lea eax, [eax+16]
ja @Loop
lea eax, [eax+ecx-16] ; save the possible match start
mov edi, edx
mov esi, eax
jnc @NotFound
sub edi, eax
@@: ; compare the strings
movdqu xmm1, oword ptr [esi + edi]
pcmpistrm xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK ; mask out invalid bytes in the haystack
movdqu xmm4, oword ptr[esi]
pand xmm4, xmm0
pcmpistri xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
lea esi, [esi+16]
ja @b
jnc @Found
add eax, 1 ; continue searching from the next byte
jmp @Loop
@NotFound:
xor eax, eax
@Found:
pop edi
pop esi
ret
InstrLingo endp
MasmBasic's Instr_() has minor optimisations for speed and size but is basically the same.
Thank you JJ,
My algo works without error.
Tested it with similar texts from the attachment. :biggrin:
Hi
The code is fast but potentially unsafe.
If the fetched memory blocks cross a page boundary to an uncommited page, an exception is raised.
The only way to use this code safely is to align the strings to 16.
Biterider
Quote from: lingo on August 05, 2023, 11:24:15 PMMy algo works without error.
If you subtract the source, as noted by myself and Caché GB, Kankowski's algo works fine indeed.
624 ms for CRT strstr result=105560497
588 ms for M32 InString result=105560497
172 ms for BMHBinsearch result=105560497
146 ms for BMBinSearch result=105560497
374 ms for Boyer-M Mineiro 1 result=105560497
369 ms for Boyer-M Mineiro 2 result=105560497
62 ms for Lingo result=143964624
63 ms for MB Instr_() result=105560497Quote from: Biterider on August 05, 2023, 11:33:37 PMThe code is fast but potentially unsafe.
If the fetched memory blocks cross a page boundary to an uncommited page, an exception is raised.
The only way to use this code safely is to align the strings to 16.
Biterider
I had that suspicion, too. Maybe we need a test case...
Quote from: jj2007 on August 05, 2023, 11:48:52 PMI had that suspicion, too. Maybe we need a test case...
Searching a bit I found some discussions about this issue, one of them is
https://www.purebasic.fr/english/viewtopic.php?p=605105 (https://www.purebasic.fr/english/viewtopic.php?p=605105)
One possible solution is to align the pointer for the retrieved characters and invalidate the first characters that are not part of the string. Unfortunately, masking is not possible with this instruction, but the invalid characters can be replaced with non-zero characters (byte/word), e.g. "01", that are not in the string to be searched.
This means a little more execution time, but then the function could be used universally.
Biterider
Hi
When trying to implement something to solve the pafe boundary problem, I quickly discovered that the "needle" has exactly the same problem, which increases the complexity considerably.
Atm I can't see a simple solution for both issues that are efficient enough. :nie:
I wonder sometimes what the developers of such instructions have in mind when it comes to implementing the code... :eusa_pray:
Biterider
Quote from: Biterider on August 06, 2023, 07:14:52 PMAtm I can't see a simple solution for both issues that are efficient enough. :nie:
Test case:
invoke VirtualAlloc, 0, 4096, MEM_COMMIT, PAGE_READWRITE
xchg eax, edi
invoke lstrcpyn, edi, esi, 4096 ; esi is a pointer e.g. to contents of Windows.inc
Now perform the instr() on edi :cool:
It works fine with SEH, but I agree there is no other efficient solution :sad:
As it stands now, I strongly recommend not using these implementations for general purposes, as this will lead to unexplained and random crashes in customer code, resulting in a sh*storm.
Biterider
Quote from: Biterider on August 06, 2023, 08:52:20 PMAs it stands now, I strongly recommend not using these implementations for general purposes, as this will lead to unexplained and random crashes in customer code, resulting in a sh*storm.
I've used them for over a year in all my code, no crashes. HeapAlloc'ed strings and buffers are not a problem, in contrast to this:
invoke VirtualAlloc, 0, 4096, MEM_COMMIT, PAGE_READWRITE
xchg eax, edi
Let esi=FileRead$("\Masm32\include\Windows.inc")
invoke lstrcpyn, edi, esi, 4096 ; esi is a pointer e.g. to contents of Windows.inc
Let sub2$="ACMFILTERCHOOSEHOOKPROC typ" ; "typ" is ok, "type" fails with an exception
Print Str$("The pos is %i\n", Instr_(edi, sub2$))
The solution is SEH.
Hi JJ
It is only a matter of time before this code bites you. :badgrin:
If the SEH handler has to catch this situation, what should it do? Fall back on the regular implementation?
Thinking about that, you have to protect every single call to these new procedures...
A higher level handler is also possible, but the complexity does not justify it.
Another side effect is that thousands and thousands of CPU cycles are wasted when triggering the SEH mechanism.
Still not a good solution for production code...
Biterider
Quote from: Biterider on August 06, 2023, 10:29:37 PMIf the SEH handler has to catch this situation, what should it do? Fall back on the regular implementation?
Return zero = not found.
QuoteThinking about that, you have to protect every single call to these new procedures...
Once, at program start.
QuoteAnother side effect is that thousands and thousands of CPU cycles are wasted when triggering the SEH mechanism.
Nope, the timings are exactly the same if a match is found. If no match is found
and the SEH is triggered, then you have a design problem.
invoke VirtualAlloc, 0, 4096, MEM_COMMIT, PAGE_READWRITE
xchg eax, edi
Let esi=FileRead$("\Masm32\include\Windows.inc")
invoke lstrcpyn, edi, esi, 4096 ; esi is a pointer e.g. to contents of Windows.inc
Let sub32$="ACMFILTERCHOOSEHOOKPROC type" ; type fails with an exception, "typ" is ok
Let sub31$=Left$(sub32$, 31)
Print Str$("For 31 chars, the pos is %i\n", Instr_(edi, sub31$))
Print Str$("For 32 chars, the pos is %i\n", Instr_(edi, sub32$))
Inkey "That was fun with SEH"
Output:
For 31 chars, the pos is 4058
Caught at pos: $esi def DW
For 32 chars, the pos is 0
That was fun with SEH
Quote from: jj2007 on August 06, 2023, 11:27:25 PMQuoteIf the SEH handler has to catch this situation, what should it do? Fall back on the regular implementation?
Return zero = not found.
It is obviously wrong to always return 0 when you have a match, which leads to wrong results.
Quote from: jj2007 on August 06, 2023, 11:27:25 PMQuoteThinking about that, you have to protect every single call to these new procedures...
Once, at program start.
You are talking about a TL-handler that has to deal with all kinds of exceptions coming from somewhere in your application.
This means that you have to analyse exactly where the exception occurred and what went wrong, and then unwind the whole thing to the offending procedure, which is a leaf procedure.
Quote from: jj2007 on August 06, 2023, 11:27:25 PMQuoteAnother side effect is that thousands and thousands of CPU cycles are wasted when triggering the SEH mechanism.
Nope, the timings are exactly the same if a match is found. If no match is found and the SEH is triggered, then you have a design problem.
You misinterpreted my comment, I said when an exception is triggered. That comes from the transitions from user-land to ring 0 and vice-versa.
The whole thing is not about SEH, it is about code quality degradation. :nie:
It is expected that when a proc that works perfectly fine is replaced by a better/faster version that it returns exactly the same results.
It makes no sense to replace it with a faster version that may work 99.9% but fails under some conditions the user has no control over. :eusa_naughty:
For sure there are cases where a wrong result can be tolerated, but imagine you use this code in a critical process and this super fast routine fails 1/1000 times crashing a machine/car/rocket.
No one in their right mind would use it as a general purpose routine for a few microseconds that don't matter in the end.
I don't want to hurt anyone's feelings, but wouldn't it be better to find a better and suitable solution? :wink2:
Biterider
Hi Biterider, it seems I edited your post and made a mess out of it - apologies.
QuoteIt is obviously wrong to always return 0 when you have a match, which leads to wrong results.
Describe a situation where an exception was triggered but there still was a match.
QuoteYou are talking about a TL-handler that has to deal with all kinds of exceptions coming from somewhere in your application.
This means that you have to analyse exactly where the exception occurred and what went wrong, and then unwind the whole thing to the offending procedure, which is a leaf procedure.
You didn't check the attachment posted above.
QuoteQuote from: jj2007 on August 06, 2023, 11:27:25 PMQuoteAnother side effect is that thousands and thousands of CPU cycles are wasted when triggering the SEH mechanism.
Nope, the timings are exactly the same if a match is found. If no match is found and the SEH is triggered, then you have a design problem.
You misinterpreted my comment, I said when an exception is triggered. That comes from the transitions from user-land to ring 0 and vice-versa.
Yes, we are wasting a millisecond or so in the extremely unlikely event of an exception.
Quote from: jj2007 on August 07, 2023, 06:11:13 AMDescribe a situation where an exception was triggered but there still was a match.
Suppose you have an ANSI zero terminated string "abcABC" at the end of a page and the next page is not commited.
Now you start searching for "ABC" at a 16 byte unaligned address. You overshoot the page boundary by 9 bytes raising an exception and you should return a match result.
? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? |
? | ? | ? | ? | ? | ? | ? | ? | ? | a | b | c | A | B | C | 0 |
X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |
X | X | X | X | X | X | X | X | X | X | X | X | X | X | X | X |
Biterider
Well, you have a point there :thumbsup:
It's not so relevant for MasmBasic, because strings and buffers are all HeapAlloc'ed with a little reserve of 24 bytes that helps to avoid this scenario.
See Peter Cordes' very detailed description of pcmpistri & pcmpestri performance at SOF (https://stackoverflow.com/questions/58901232/why-is-sse4-2-cmpstr-slower-than-regular-code)
I have a version that checks the bounds, and it's only a tick slower than the fast version, but I really don't think it's worth the effort. Real life strings and buffers should never end at no man's land.
Btw this logic should be applied to all code. Attached an example that shows the behaviour of the Masm32 SDK len() macro. You might be surprised :cool:
Hi, JJ, i´m trying to create a benchmark test on a new BinarySearch/Instring function i did (specifically with SSE2 only), but i couldn´t be able to implement it on your benchmark app to test it. Can, u please input it on your bench algo so we can test the speed of it ?
It is designed to work with any pattern size. So, it can handle strings (or data chunks) from 1 to XXX bytes. Since i focused on speed, i didn´t implemented on the function an error check in case the pattern is smaller than the dataset or check if the size of the data/pattern are zeroed etc. The size of the dataset must be bigger or equal to 16 bytes long
I attached the masm compatible source code (the same syntax i used when testing on your benchmark app).
If you need explanations on usage here is The RosAsm code (With explanations of how the function works):
Main function SIMD_BinaryScan
;;
SIMD_BinaryScan
This function performs a binary search for a chunk of data (pSearchPattern) within another larger dataset (pDataBuffer).
It uses SSE2 instructions for optimization and speed, especially with large data sets.
The function can be used to find data from any size, including strings (case sensitive) on a big text
Parameters:
pDataBuffer (in): Pointer to the start of the larger dataset in which the search is performed.
This is the buffer in which you are searching.
DataSize (in): The size (in bytes) of the Data.
pSearchPattern (in): Pointer to the start of the Search Pattern (the data chunk to search for within the DataSet).
This is the smaller data chunk you are searching for.
PatternSize (in): The size (in bytes) of the SearchPattern.
Return Value:
Success: If the Pattern is found, the function returns a pointer to the starting position of the matching chunk in the Dataset.
Failure: Returns 0 (&NULL) if the Pattern is not found in the Dataset.
Remarks:
Main Function Flow:
1 - Initialize First and Last Byte of the Pattern:
The first and last bytes of the Pattern are expanded into SSE registers xmm2 and xmm3 for comparison against blocks of the DataSet.
2 - Loop through Dataset Blocks:
Each block of 16 bytes from the Dataset is loaded into xmm0 and xmm1 to compare the first and last bytes with the corresponding bytes of the pattern.
SSE2 instructions (pcmpeqb, pand, pmovmskb) are used to generate masks and compare these blocks efficiently.
3 - Check for Match:
If matching bits are found, the position of the first matching byte is retrieved, and a detailed byte-by-byte comparison is done with
SIMD_ComparePattern to confirm the match.
If no match is found, the loop continues with the next block of data in the Dataset.
4 - Handle Remainder:
If the Dataset's size is not a multiple of 16, any remaining bytes are handled separately at the end of the Dataset.
5 - Return Result:
If the pattern is found, the function returns the position of the match; otherwise, it returns 0 (no match).
Example of usage:
[haystack: B$ "This is a simples string to search within. While trying to find sinp from sinples string", 0]
[needle: B$ "search wi", 0]
call 'FastCRT.StrLen' haystack
mov ebx eax
call 'FastCRT.StrLen' needle
call SIMD_BinaryScan haystack, ebx, needle, eax
;;
Proc SIMD_BinaryScan:
Arguments @pDataBuffer, @DataSize, @pSearchPattern, @PatternSize
Local @BitPos, @StringEnd, @MaxPos
Structure @XmmPreserve 80, @XMMReg0Dis 0, @XMMReg1Dis 16, @XMMReg2Dis 32, @XMMReg3Dis 48, @XMMReg4Dis 64
Uses esi, ebx, edx, edi
; save the contents xmm registers to avoid altering them
lea eax D@XMMReg0Dis | movdqu X$eax XMM0
lea eax D@XMMReg1Dis | movdqu X$eax XMM1
lea eax D@XMMReg2Dis | movdqu X$eax XMM2
lea eax D@XMMReg3Dis | movdqu X$eax XMM3
; 1st get pointer to the string/data to be found
mov esi D@pSearchPattern
; get the 1st char of the
movzx eax B$esi | SSE2_EXPAND_BYTE2 xmm2 eax ; xmm2 = extended 1st byte of the string xmm2 = Var_First
mov eax esi | add eax D@PatternSize
movzx eax B$eax-1 | SSE2_EXPAND_BYTE2 xmm3 eax ; xmm3 = extended last byte of the string xmm3 = Var_Last
mov esi D@pDataBuffer | mov eax esi | add eax D@DataSize | sub eax 16 | mov D@StringEnd eax
..Do
movups XMM0 X$esi; xmm0 = Var_BlockFirst.
mov eax D@PatternSize
movups XMM1 X$esi+eax-1; xmm1 = Var_BlockLast.
; we need to create a mask formed by the comparition of the last bytes.
; Var_EqFirst = comparition between Var_First and Var_BlockFirst. The result is stored in xmm0
pcmpeqb xmm0 xmm2 ; compare Var_First with previous Var_BlockFirst = Var_EqFirst
; Var_EqLast = comparition between Var_Last and Var_BlockLast. The result is stored in xmm1
pcmpeqb xmm1 xmm3 ; compare Var_Last with previous Var_BlockLast = Var_EqLast
pand xmm1 xmm0 ; Now we simply need to and those masks (Var_EqFirst and Var_EqLast). We store in xmm1 to preserve xmm0 to reuse below
pmovmskb ebx XMM1 ; And finally create that mask and copy it onto eax. Maximum value = 0FFFF
.Test_If_Not_Zero ebx
Do
; get first bit_set
GET_FIRST_BIT_SET ebx | mov D@BitPos eax
lea edi D$esi+eax
call SIMD_ComparePattern edi, D@pSearchPattern, D@PatternSize
cmp eax 0 | jnz L2>>
;bsr ax bx | btr bx ax | jc L1< ; in practise, the carry flag will be settled only when ebx (bx) is not zero. So, the loop will works whenever ebx <> 0
; clear leftmost bit_set
CLEAR_LEFTMOST_SET16 bx
Repeat_Until_Zero
; now we store the last known bitpos where the string was not matched,
; and add it to the string size to get the next position to search
mov eax D@BitPos | add eax D@PatternSize
.Test_Else
pmovmskb eax xmm0 ; And finally create that mask and copy it onto eax. Maximum value = 0FFFF
Test_If_Not_Zero eax
; get last bit_set
GET_LAST_BIT_SET eax
inc eax ; The last pair of bits didn´ matched. Therefore, we need to increase only by 1 to do to the next position
Test_Else
mov eax 16
Test_End
.Test_End
add esi eax
..Loop_Until esi > D@StringEnd
; Now handle remainder
.If D@PatternSize <= 16
mov esi D@StringEnd
movups XMM0 X$esi; xmm0 = Var_BlockFirst.
; we need to create a mask formed by the comparition of the last bytes.
; Var_EqFirst = comparition between Var_First and Var_BlockFirst. The result is stored in xmm0
pcmpeqb xmm0 xmm2; compare Var_First with previous Var_BlockFirst = Var_EqFirst
pmovmskb ebx XMM0 ; Create that mask and copy it onto eax. Maximum value = 0FFFF
.Test_If_Not_Zero ebx
mov eax 16 | sub eax D@PatternSize | mov D@MaxPos eax
Do
; get first bit_set
GET_FIRST_BIT_SET ebx; | mov D@BitPos eax
cmp eax D@MaxPos | jnbe L1> ; exit if data is not below or equal (So, jump if bigger than ...)
lea edi D$esi+eax
call SIMD_ComparePattern edi, D@pSearchPattern, D@PatternSize
cmp eax 0 | jnz L2>
; clear leftmost bit_set
CLEAR_LEFTMOST_SET16 bx
Repeat_Until_Zero
.Test_End
.End_If
L1:
xor eax eax
L2:
lea esi D@XMMReg0Dis | movdqu XMM0 X$esi
lea edi D@XMMReg1Dis | movdqu XMM1 X$edi
lea esi D@XMMReg2Dis | movdqu XMM2 X$esi
lea edi D@XMMReg3Dis | movdqu XMM3 X$edi
EndP
Internal byte compare function
;;
SIMD_ComparePattern
This function, performs a detailed byte-by-byte comparison of two buffers. It is called once a potential
match is found by the SSE2 block comparison.
Parameters:
pBuffer1 (in): Pointer to the block in the haystack (loaded in esi).
pBuffer2 (in): Pointer to the needle (loaded in edi).
length (in): Length (in bytes) of the buffers to be compared.
Return Value:
Success: Returns the pointer to pBuffer1 if the contents match exactly.
Failure: Returns 0 if the contents do not match.
Remarks:
This function is an adaptation of memcmp_SSE2, but to be used only inside SIMD_BinaryScan.
It is also developed for speed, for this reason it don't preserve the registers (except esi)
that are being used in SIMD_BinaryScan.
;;
Proc SIMD_ComparePattern:
Arguments @pBuffer1, @pBuffer2, @Lenght
Local @Reminder
Uses esi; No need to preserve edi and others registers too since this function is used only inside SIMD_BinaryScan.
; Step1 - Check if the lenght is zeroed. If lenght is 0, jump over the whole function
mov edx D@Lenght
mov esi D@pBuffer1
mov edi D@pBuffer2
and edx 0-16 | mov eax edx | xor edx D@Lenght ; When 0 means lenght is divisible by 16 and we have no remainders, jmp over to the main function
; Step2
.Test_If_Not_Zero eax ; edx = 0 ? no remainders exit
mov D@Reminder edx
; We are working with multiples of 16 bytes, therefore, we can get rid of using shr eax 4 and directly subtract 16 bytes on each loop
; On this way we get a bit more of speed while keeping performance. Interesting is that add eax 0-16 seems a bit faster then sub eax 16
.Do
movdqu xmm0 X$esi
movdqu xmm1 X$edi
pcmpeqd xmm0 xmm1
movmskps edx xmm0
If edx <> 0F
xor eax eax | jmp L4>>
End_If
add esi 16
add edi 16
add eax 0-16 ; subtract by 16 bytes. Faster then sub
.Repeat_Until_Zero
mov edx D@Reminder
.Test_End
.Test_If_Not_Zero edx
; Now becomes the trick. Here we will calculate the remainders of our previous routine stored in edx
; So, assuming our initial value was 43, we now have only 11 as remainder, which is 2 Dwords (8 bytes) + 3 bytes
; So we have 2 main blocks to check. One containing only the dwords checking. So, a value whose range is 0 to 12 (multiple of 4 only, so, 0, 4, 8, 12
; and other related to the final bytes (3), whose values are a maximum of 3, so 0, 1, 2, 3
; We will store the Dwords computation in dl and the Byte computation in dh
; This way is faster then we use shr eax 2 or something to divide the dword by 4 and later compute the other byte remainders
mov dh dl
and dl 0-4 ; dl = our Dword computation. So, values multiples of 4 only, on a maximum of 12
xor dh dl ; dh = our byte computation. So, values are multples of 1 only, on a maximum of 3
Test_If_Not_Zero dl
; If dl = 0, so if we have no dwords remainder, jmp over to compute the byte remainders
mov eax D$esi | cmp eax D$edi | jz L1> | xor eax eax | jmp L4> | L1: | add esi 4 | add edi 4 | add dl 0-4 | jz L2>
mov eax D$esi | cmp eax D$edi | jz L1> | xor eax eax | jmp L4> | L1: | add esi 4 | add edi 4 | add dl 0-4 | jz L2>
mov eax D$esi | cmp eax D$edi | jz L1> | xor eax eax | jmp L4> | L1: | add esi 4 | add edi 4
Test_End
L2:
; check if we have no byte remainders
Test_If_Not_Zero dh ; dh = our byte computation. So, values are multples of 1 only, on a maximum of 3
mov eax 0 ; surprisinly a bit faster then xor eax eax
mov dl B$esi | cmp dl B$edi | jnz L4> | dec dh | jz L2>
mov dl B$esi+1 | cmp dl B$edi+1 | jnz L4> | dec dh | jz L2>
mov dl B$esi+2 | cmp dl B$edi+2 | jnz L4>
L2:
Test_End
.Test_End
mov eax D@pBuffer1
L4:
EndP
Macros used:
;;
SSE2_EXPAND_BYTE and SSE2_EXPAND_BYTE_FAST
These macros replicates a single byte from a 32-bit register across all 16 bytes (128 bits) of an XMM register.
They are similar to _mm_set1_epi8() in SSE2.
SSE2_EXPAND_BYTE Uses only SSE2 registers. So it preserves normal x86 registers while working.
SSE2_EXPAND_BYTE_FAST performs a similar task but uses an alternate method to replicate the byte by multiplying it by 0x01010101
before moving it to the XMM register. The only difference is that it changes the value of edx. So, if you intend
to use it, make sure to preserve edx before using it, in case you need edx in other parts of your code.
SSE2_EXPAND_BYTE
Explanation and Usage:
movd #1, #2: Move the 32-bit value in the normal register (#2) into the lower 32 bits of the XMM register (#1).
punpcklbw #1, #1: Unpack the bytes in the lower part of the XMM register into words, filling all 16 bits with the original byte.
punpcklwd #1, #1: Unpack the words into double words, expanding the byte across more space.
pshufd #1, #1, 0: Shuffle the four double words to fill all 16 bytes of the XMM register with the same byte.
Example of usage:
SSE2_EXPAND_BYTE xmm0 eax ; expands the byte in eax to fill xmm0.
SSE2_EXPAND_BYTE_FAST
Explanation and Usage:
imul #2, #2, 01010101: Multiply the value in the 32-bit register by 0x01010101, replicating the byte across the register.
movd #1, #2: Move the replicated byte from the normal register (#2) to the XMM register (#1).
pshufd #1, #1, SSE_FILL_DWORDS: Shuffle the double words to fill all 16 bytes in the XMM register with the byte.
[SSE_FILL_DWORDS 0] ; equate used for shuffling XMM registers to fill double words (pshufd shuffle mask).
;;
[SSE_FILL_DWORDS 0]
[SSE2_EXPAND_BYTE | movd #1 #2 | punpcklbw #1 #1 | punpcklwd #1 #1 | pshufd #1 #1 0]
[SSE2_EXPAND_BYTE2 | imul #2 #2 01010101 | movd #1 #2 | pshufd #1 #1 SSE_FILL_DWORDS]
;;
CLEAR_LEFTMOST_SET16
This macro clears the left most distance bit on a 16 bit data.
Parameters
#1 (in/out)- A word size data or registers. Since the macro uses ax to compute, don´t use ax registers as input.
Return value:
The macro will return in the input the converted 16bit value. Also it will set the zero flag if it results 0
Remarks:
The macro uses ax register internaly. So on input you cannot use ax or you will get a incorrect value
Explanation:
mov ax, #1 – Move the value of the 16-bit register (like bx, cx, etc.) into ax.
add #1, 0-1 – Subtract 1 from the value of the 16-bit register.
and #1, ax – Perform a bitwise AND between the original value and the decremented value (after subtracting 1).
Example of usage:
CLEAR_LEFTMOST_SET16 bx
Will unroll to: mov ax bx | add bx 0-1 | and bx ax
;;
[CLEAR_LEFTMOST_SET16 | mov ax #1 | add #1 0-1 | and #1 ax]
;;
GET_FIRST_BIT_SET
This macro locates the position of the first (lowest) set bit in the given register.
Macro Code:
[GET_FIRST_BIT_SET | bsf eax, #1]
Explanation and Usage:
bsf eax, #1: Bit Scan Forward—find the position of the first set bit in #1 and store the result in eax.
Finds the position of the first set bit (least significant bit) in the given register.
Example: GET_FIRST_BIT_SET ebx ; stores the position of the first set bit in ebx into eax.
;;
[GET_FIRST_BIT_SET | bsf eax #1]
;;
GET_LAST_BIT_SET
This macro locates the position of the last (highest) set bit in the given register.
Macro Code:
[GET_LAST_BIT_SET | bsr eax, #1]
Explanation and Usage:
bsr eax, #1: Bit Scan Reverse—find the position of the last set bit in #1 and store the result in eax.
Finds the position of the last set bit (most significant bit) in the given register.
Example: GET_LAST_BIT_SET ebx ; stores the position of the last set bit in ebx into eax.
;;
[GET_LAST_BIT_SET | bsr eax #1]
Hi Guga,
Attached the source. There is a problem with the parameters, though. Are you sure the sub esp, 4 is correct?
Jochen
Hi JJ
tks you a lot.
Yes, i´m sure.
The problem you found is because you called the internal function and not the main function.
The main function to use is: SIMD_BinaryScan (Uses 4 parameters)
While SIMD_ComparePattern (Uses 3 parameters) is used only inside the SIMD_BinaryScan without preserving any registers (That are reused by the main function) and cannot be used alone (outside that function). The SIMD_ComparePattern function is a adaptation i made for the previous memcmp_SSE2 i posted here https://masm32.com/board/index.php?topic=12256.msg133369#msg133369
Replace this:
NameA equ Guga compare pattern ; assign a descriptive name for each test
TestA proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
push 4
push chr$("zero")
push 100
push offset somestring
call SIMD_ComparePattern
dec ebx
.Until Sign?
movd xmm0, eax
ret
with this:
NameA equ Guga compare pattern ; assign a descriptive name for each test
TestA proc
mov ebx, AlgoLoops-1 ; loop e.g. 100x
align 4
.Repeat
push 4
push chr$("zero")
push 100
push offset somestring
call SIMD_BinaryScan
dec ebx
.Until Sign?
movd xmm0, eax
ret
here are my results:
AMD Ryzen 5 2400G with Radeon Vega Graphics (SSE4)
5658 cycles for 100 * Guga compare pattern
8070 cycles for 100 * Masm32 len
5613 cycles for 100 * Guga compare pattern
8264 cycles for 100 * Masm32 len
5630 cycles for 100 * Guga compare pattern
8001 cycles for 100 * Masm32 len
5585 cycles for 100 * Guga compare pattern
8118 cycles for 100 * Masm32 len
5765 cycles for 100 * Guga compare pattern
8018 cycles for 100 * Masm32 len
535 bytes for Guga compare pattern
16 bytes for Masm32 len
4202622 = eax Guga compare pattern
100 = eax Masm32 len
--- ok ---
Btw, on your code, this:
push chr$("zero")
is the same as
push offset Somestring2 ; where SomeString2 = db "zero", 0 right ?
Can u please try also with a longer pattern? I mean, longer than 16 bytes and somestring be around 50 Kb (or something). I would like to see how the function behave on a larger dataset and a larger pattern
Btw, if the speed is Ok and we have no more room for optimization (on SSE2), i´ll try to make another version that work backwards (from bottom to top). Those routines can be handy for a faster string and replace algorithm (Those that includes the direction to search, match cases, whole words, ignore chars etc)
Note:
I know that we could speed it a little bit in SIMD_BinaryScan making the 1st loop advance 1 byte and decreasing the pattern size by 2, but if i do this, the app won´t work if the pattern is only 1 byte, and in order to fix it i´ll have to do a work around specifically for patterns of only 1 char, which, at the end will make useless trying to gain more 8 to 10% of speed (because it will slow again, if i had to make a workaround). Example,on the 1st loop (when the pattern is bigger or equal to 16 bytes, we could simply do this:
mov eax esi | add eax D@PatternSize
movzx eax B$eax-1 | SSE2_EXPAND_BYTE_FAST xmm3 eax ; xmm3 = extended last byte of the string xmm3 = Var_Last
mov esi D@pDataBuffer | mov eax esi | add eax D@DataSize | sub eax 16 | mov D@StringEnd eax
..Do
(....)
; get first bit_set
GET_FIRST_BIT_SET ebx | mov D@BitPos eax
lea edi D$esi+eax+1
call SIMD_ComparePattern edi, D@pSearchPattern2, D@PatternSize2
; where D@pSearchPattern2 = D@pSearchPattern + 1 byte. SO, points to the 2nd byte in pSearchPattern
; where D@PatternSize2 = D@PatternSize - 2 bytes
The above piece of code i tested before, and indeed i gained around 8 to 10% of speed, because it will compare a smaller pattern (It will compare the pattern, except the 1st and last byte, since it already matched). The problem is when the pattern is only 1 byte long (single char). So, i choose to keep the comparison on the full pattern, and remove all register preservations inside SIMD_ComparePattern (except for esi).Removing the preservation of registers (normal x86 and xmm) improved the speed, even though it could still be a bit more faster if i compared with a smaller pattern.