SIMD_BinaryScan (Instring equivalent)

guga · March 24, 2025, 05:08:06 PM

Hi Guys

i updated the algorithm on the 1st post. Also inserted comments and the correspondent masm version to be tested (just replace the xmmword with oword - or equivalent for SSE2 in masm).

guga · March 25, 2025, 09:26:39 AM

As reported on the 1st post Link, if anyone else wants to give a try on this algo, or improve it for SSE2 obeying the limits of a xmm register (if possible), here are a couple of things that may be done:

1 - Create a flag before the loop for the cases where the pattern is only 2 bytes. True = Match Pairs ok. False = Not yet, the pattern is bigger than 2 bytes. This flag can be accessed right after the .Test_If_Not_Zero edx, something like.:

(...)
.Test_If_Not_Zero edx
Do
; Find the first set bit in the bitmask.
GET_FIRST_BIT_SET edx | mov D@FirstBitPos eax
cmp D@PatternSize 2 | jz L2>> So, since the pattern is only 2 bytes and we already found them from the SIMD computation, there´s no need to go further onto the call to the helper functions.

2 - make usage of xmm4 and xmm5 to store the 2nd and penultimate byte from the pattern and extend them to the correspondent SSE registers. Then on the SImD comparison , use registers xmm6 and xmm7 to store them and also be compared as we did for xmm0 and xmm1. On this way, we will compare at once, 4 bytes from the pattern instead of 2. And then simply add another pand between the resultant pairs and create the mask on edx as we did in pmovmskb edx XMM1. The problem is that, if you use the 2nd and penultimate byte, you also needs to adjust the pointer to esi to they points to the proper places, something like:

movq XMM0 X$esi ; Load the first 16 bytes of the current block into xmm0. Var_BlockFirst.
movq XMM1 X$esi+ebx-1 ; Load the last 16 bytes of the current block into xmm1. Var_BlockLast.
movq XMM6 X$esi+1 ; Load the 2nd 16 bytes of the current block into xmm0. Var_BlockFirst.
movq XMM7 X$esi+ebx-2 ; Load the penultimate 16 bytes of the current block into xmm1. Var_BlockLast.

The main problem will relies in obeying the boundaries of the registers to avoid it tries to point to addresses outside the limits of the data. So, you most likely will need to use another x86 register to adjust those pointers dynamically. Something like:

movq XMM0 X$esi ; Load the first 16 bytes of the current block into xmm0. Var_BlockFirst.
movq XMM1 X$esi+ebx-1 ; Load the last 16 bytes of the current block into xmm1. Var_BlockLast.
mov eax esi | add eax D@SecondByteFlag
movq XMM6 X$eax ; Load the 2nd 16 bytes of the current block into xmm6. Var_BlocSecond.
mov eax esi | sub eax D@PenultimateByteFlag
movq XMM7 X$eax+ebx ; Load the penultimate 16 bytes of the current block into xmm7. Var_BlockPenultimate.

Here on this example, those flags represents the amount of bytes to advance or subtract according to the size of the pattern. This must be done before the loop as well. Something like:

If pattern = 2 bytes, D@SecondByteFlag = 0, D@PenultimateByteFlag = 1 ; always starts with 1 subtraction
If pattern = 3 bytes, D@SecondByteFlag = 1, D@PenultimateByteFlag = 1 ; Pattern is odd, so we need only to point the last byte to itself (-1 as above)
If pattern >= 4 bytes, D@SecondByteFlag = 1, D@PenultimateByteFlag = 2 ; Pattern is bigger or equal to4 bytes (odd or even, it won´t matter), so we need only to compare the 2 pairs of the start of the pattern and 2 pairs at the end, regardless the size of the pattern.

In theory, this could work and force the algorithm to advance 16 bytes more often making it be a bit faster, since the chances of 4 matching pairs at those same positions are low in real life applications, in most of the cases (Specially if the data set is composed of random bytes).
The only problem i see in adding those 2 more techniques is if that will really increase more the speed or we just added overheads to the algo, making it slower.

zedd · March 26, 2025, 03:23:44 PM

Results from running executable in "BinScan24Mar2025.zip" attached as a .txt file (not zipped)

guga · March 26, 2025, 03:50:09 PM

Tks a lot, Zedd

It seems to be working as expected now. Faster and handling all the data. I´m currently working on the reversed function now. I thought it was easier to develop the exact mirror of SIMD_BinaryScan, since i succeeded to fix the original version, but it´s a bit hard to make it stays in the limits when working backwards. I´m trying to see why this is happening, and when succeed, i´ll post here also the backwards function.

Those functions to properly scan data (forward or backward) are the ones i need to fix soon because i´m trying to use them not only inside RosAsm, but making them in general usage as well. Im currently trying to fix several old functions in RosAsm to release a newer version (If i succeed to make the functions works more independently from each other and create the proper dlls for that). I need to fix RosAsm and make it easier to handle the disassembler, debugger, decoder and encoder and also create the proper internal tables to handle the necessary data. On this way, it could be easier to someday - eventually - i decide to create a 64bit version for RosAsm as well. Personally, i´m not a big fan of 64 bits apps, but they are in fact needed nowadays, specially when developing plugins or small apps to work with image, audio and video processing. But, making a 64 bit version of RosAsm is impossible to port from the current Source code - That´s why i´m trying to fix it.

guga · April 02, 2025, 12:44:27 AM

Hi guys, the reverse function is almost finished. I succeeded to make the reverse function (Scan from end to start) with similar speed as the normal version. I found only 1 mistake on my recent tests and it seems to happens on aligned strings (aligned by 16 bytes). I´m pretty sure it is missing only 1 byte somewhere, i´ll try to fix it before the next release. Also, i updated the code again on the normal function. (I focused more on avoid the extrapolation, but the speed is the same as the last one - or faster in some cases). Once i finishes i´ll post it here.

Next steps of those algorithms i´ll focus on strings:
1 - Create the correspondent search for strings on case insensitive mode, containing also special Ascii chars (latin origins, such as portuguese, french, german, italian, spanish, swiss and some nordic/scandinavian languages: danish, norwegian, swedish)
2 - Create a routine to skip some user defined chars
3 - Create a routine to check for the boundaries of a string (word edge). In fact, the word edge is already implemented, but i need to adapt the updated version i made for RosAsm onto this new string search routines
4 - Unicode version (UTF16), but the simpler one...The one that uses only the char followed by 0, such as in Unicode C, Pascal style, Delphi style etc (Sorry, no plans to work so soon on chinese, japanese, russian or other fancy unicode chars)

Note: For fancy languages ( russian, chinese, japanese etc) that uses UTF16 or even UTF32 etc, the normal algorithms (SIMD_BinaryScan and it´s reversal), should work already, but it won´t work for case insensitive yet, because i have no idea if (or how) such languages deals with case insensitive strings)

References for the Unicode versions to analyze later:
What is unicode ?
The ISO latin table
UTF8 encoding
UTF16

guga · April 02, 2025, 12:03:32 PM

Finished the backwards algo. I´ll clean-up the code and put the comments on it and post here both (The normal version updated and this one). I tested both and they seems to work without errors (as far i saw). I loaded a big text (Around 6.5 Mb) from Project Guttenberg called: "The Project Gutenberg EBook of The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle (#15 in our series by Sir Arthur Conan Doyle)" and starting scanning it. The test consists in loading the text in memory and creating patterns starting from 1 Byte that are incremented on each loop. So, on the 1st iteration, it take 38 bytes from the start, used it as a pattern and scan it on the big text. On the next iteration, it adds 1 byte to the older pos and then search 38 bytes as a pattern on the big text and it continues incrementing and scanning until it reaches the end of the 6.5 Mb file. I only limited the size of the pattern to a maximum of 38 bytes in order to see it the algo will crash even considering that the pattern position is being incremented at each loop, and therefore we could reach some unaligned and aligned text to perform the scans, depending of position of the newly created pattern on each loop. Fortunately, it always found the patterns created and never bypassed the limits of SSE registers in any of it´s internal functions.

So it processed something around 6.5Mb * 6.5Mb (42.5 Tb) of information. On my tests it took a bit more than one hour to perform the whole thing. So, not counting the limitation of my CPU, the speed of this thing (in theory) can reach something around 11.74 Gb/sec.

The tests i did was like this:

Code Select

[MyTxt: D$ 0] ; Pointer to the contents of Big.txt
[MyTxtSize: D$ 0] ; Size of Big.txt
[GugaDymmyCnt: D$ 0]

    call 'FastCRT.ReadOpenedFile' &NULL, {B$ "big.txt", 0}, MyTxt, MyTxtSize, 0
    mov eax D$MyTxtSize | mov D$GugaDymmyCnt eax
    mov edi D$MyTxt
    Do
        call SIMD_BinaryScan_Back D$MyTxt, D$MyTxtSize, edi, 38
        If eax = 0
            mov eax eax ; If debugger paused here, it means we found an error. Fortunately, it never reached this point, meaning it always found the patterns it was searching for :) 
        End_If
        inc edi
        dec D$GugaDymmyCnt ; dummy counter to estimate when this thing will stop.
        lea eax D$edi+38
    Loop_Until B$eax = 0
    call 'KERNEL32.ExitProcess' 0

Once i finish the comments i´ll upload it here.

One question. The reversed functions uses as an input the starting address of the text (The same as in the regular version), and internally it points it to the end to it starts scanning backwards.I made this only to maintain some correspondence with the normal version. But, i´m considering adding a flag where the user can input to start the scanning from the end of the string to the start.

For example:

Code Select

[MyTxt: B$ "I say, Berg, my dear fellow," said Rostov, "when you get a letter
from home and meet one of your own people whom you want to talk
everything over with, and I happen to be there, I'll go at once, to be
out of your way! Do go somewhere, anywhere... to the devil!" he
exclaimed, and immediately seizing him by the shoulder and looking
amiably into his face, evidently wishing to soften the rudeness of his
words, he added, "Don't be hurt, my dear fellow; you know I speak from
my heart as to an old acquaintance.", 0]

[MyTxtSize: D$ 509]

[MyPattern: B$ "wishing", 0]
[MYpatternSize: D$ 7]

[SCAN_FROM_END 0]
[SCAN_FROM_START 1]

Code Select

call SIMD_BinaryScan_Back MyTxt, D$MyTxtSize, MyPattern, D$MYpatternSize, SCAN_FROM_START

It will scan the string "wishing" from the start "I say,..." to the end

Or

Code Select

call SIMD_BinaryScan_Back MyTxt, D$MyTxtSize, MyPattern, D$MYpatternSize, SCAN_FROM_END
It will scan the string "wishing" from the end of the text " an old acquaintance." to the start.

Later i´ll rename those flags, this was just a idea. Does a flag to do this is better, or i keep the algo without it scanning as default always from start to the end ?

guga · April 04, 2025, 03:20:13 PM

Hi Guys

Finished the implementation of the new version that works with Strings (case insensitive). So far,i´m working with Basic Latin (A to Z, a to z) so i can check the speed. It´s a bit slower than i expected, because i needed to create another helper function to store small chunks (less than 16 bytes) to a variable (In fact the function is called twice, since it needs to copy both the chunk related to the dataset and the pother related to the pattern.

Can someone test to see if the results are ok ? I mean, check if it really found the strings or if it failed to find the pattern (when it was existent in the dataset) ?

Also..is there a faster way to copy a maximum of 16 bytes to a variable (unfortunately, they will need to be on their own functions as well) ? I was considering using repe movsb etc, but i think they still can be a bit slow, right ?

Src is embedded in the app, btw. Later, if it really is working as expected, i´ll post here the source and the translation to masm (i´ll put the other versions as well - the backwards and forwards updated i previously did)

If everything is ok, and if i suceed to gain a bit more of speed on this, i´ll create another function adapted to work with all Latin Strings (german, portuguese, french, italian, spanish, and some Norwegian chars too) and later, will do the same for the backwards version of those things.

The next step is adapting both functions to handle word edges (i already made it for rosasm, but i´ll need to adapt to those functions), and also a routine to skip certain chars

guga · April 04, 2025, 06:17:32 PM

Found it how to not ruin the performance. made a simple version for copying small chunks of data

Code Select


Proc Memcpy_SmallFast:
    Arguments @pDest, @pSource, @Length
    Uses esi, edi, ecx

    mov edi D@pDest       ; Destination
    mov esi D@pSource     ; Source
    mov ecx D@Length      ; Size (max 16)

    If ecx >= 8
        movsd | movsd | sub ecx 8 | jz @DoneCopy
    End_If

    If ecx >= 4
        movsd | sub ecx 4 | jz @DoneCopy
    End_If
    rep movsb
@DoneCopy:

EndP

jj2007 · April 04, 2025, 06:43:18 PM

See Faster Memcopy in the Lab

guga · April 05, 2025, 06:05:08 PM

Tks, JJ

i had forget about that routine. I tested it, but, since it is for only for a maximum 16 bytes, the normal x86 (Memcpy_SmallFast) works better inside the context of the whole function.

The MASM Forum

News:

SIMD_BinaryScan (Instring equivalent)

guga

guga

zedd

guga

guga

guga

guga

guga

jj2007

guga