News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

SIMD_BinaryScan (Instring equivalent)

Started by guga, March 17, 2025, 05:27:34 PM

Previous topic - Next topic

guga

Hi Guys

i updated the algorithm on the 1st post. Also inserted comments and the correspondent masm version to be tested (just replace the xmmword with oword - or equivalent for SSE2 in masm).
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

As reported on the 1st post Link, if anyone else wants to give a try on this algo, or improve it for SSE2 obeying the limits of a xmm register (if possible), here are a couple of things that may be done:


1 -
Create a flag before the loop for the cases where the pattern is only 2 bytes. True = Match Pairs ok. False = Not yet, the pattern is bigger than 2 bytes. This flag can be accessed right after the .Test_If_Not_Zero edx, something like.:

(...)
.Test_If_Not_Zero edx
                Do
                    ; Find the first set bit in the bitmask.
                    GET_FIRST_BIT_SET edx | mov D@FirstBitPos eax
                   cmp D@PatternSize 2 |  jz L2>>  So, since the pattern is only 2 bytes and we already found them from the  SIMD computation, there´s no need to go further onto the call to the helper functions.

2 - make usage of xmm4 and xmm5 to store the 2nd and penultimate byte from the pattern and extend them to the correspondent SSE registers. Then on the SImD comparison , use registers xmm6 and xmm7 to store them and also be compared as we did for xmm0 and xmm1. On this way, we will compare at once, 4 bytes from the pattern instead of 2.  And then simply add another pand between the resultant pairs and create the mask on edx as we did in pmovmskb edx XMM1. The problem is that, if you use the 2nd and penultimate byte, you also needs to adjust the pointer to esi to they points to the proper places, something like:

            movq XMM0 X$esi ; Load the first 16 bytes of the current block into xmm0. Var_BlockFirst.
            movq XMM1 X$esi+ebx-1 ; Load the last 16 bytes of the current block into xmm1. Var_BlockLast.
            movq XMM6 X$esi+1 ; Load the 2nd 16 bytes of the current block into xmm0. Var_BlockFirst.
            movq XMM7 X$esi+ebx-2 ; Load the penultimate 16 bytes of the current block into xmm1. Var_BlockLast.

The main problem will relies in obeying the boundaries of the registers to avoid it tries to point to addresses outside the limits of the data. So, you most likely will need to use another x86 register to adjust those pointers dynamically. Something like:

            movq XMM0 X$esi ; Load the first 16 bytes of the current block into xmm0. Var_BlockFirst.
            movq XMM1 X$esi+ebx-1 ; Load the last 16 bytes of the current block into xmm1. Var_BlockLast.
            mov eax esi | add eax D@SecondByteFlag
            movq XMM6 X$eax ; Load the 2nd 16 bytes of the current block into xmm6. Var_BlocSecond.
            mov eax esi | sub eax D@PenultimateByteFlag
            movq XMM7 X$eax+ebx ; Load the penultimate 16 bytes of the current block into xmm7. Var_BlockPenultimate.

Here on this example, those flags represents the amount of bytes to advance or subtract according to the size of the pattern. This must be done before the loop as well. Something like:

  • If pattern = 2 bytes, D@SecondByteFlag = 0, D@PenultimateByteFlag = 1 ; always starts with 1 subtraction
  • If pattern = 3 bytes, D@SecondByteFlag = 1, D@PenultimateByteFlag = 1 ; Pattern is odd, so we need only to point the last byte to itself (-1 as above)
  • If pattern >= 4 bytes, D@SecondByteFlag = 1, D@PenultimateByteFlag = 2 ; Pattern is bigger or equal to4 bytes (odd or even, it won´t matter), so we need only to compare the 2 pairs of the start of the pattern and 2 pairs at the end, regardless the size of the pattern.

In theory, this could work and force the algorithm to advance 16 bytes more often making it be a bit faster, since the chances of 4 matching pairs at those same positions are low in real life applications, in most of the cases (Specially if the data set is composed of random bytes).
The only problem i see in adding those 2 more techniques is if that will really increase more the speed or we just added overheads to the algo, making it slower.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

zedd151

Results from running executable in "BinScan24Mar2025.zip"  attached as a .txt file (not zipped)
¯\_(ツ)_/¯   :azn:

'As we don't do "requests", show us your code first.'  -  hutch—

guga

Tks a lot, Zedd :thumbsup:  :thumbsup:  :thumbsup:

It seems to be working as expected now. Faster and handling all the data. I´m currently working on the reversed function now. I thought it was easier to develop the exact mirror of SIMD_BinaryScan, since i succeeded to fix the original version, but it´s a bit hard to make it stays in the limits when working backwards. I´m trying to see why this is happening, and when succeed, i´ll post here also the backwards function.

Those functions to properly scan data (forward or backward) are the ones i need to fix soon because i´m trying to use them not only inside RosAsm, but making them in general usage as well. Im currently trying to fix several old functions in RosAsm to release a newer version (If i succeed to make the functions works more independently from each other and create the proper dlls for that). I need to fix RosAsm and make it easier to handle the disassembler, debugger, decoder and encoder and also create the proper internal tables to handle the necessary data. On this way, it could be easier to someday - eventually - i decide to create a 64bit version for RosAsm as well. Personally, i´m not a big fan of 64 bits apps, but they are in fact needed nowadays, specially when developing plugins or small apps to work with image, audio and video processing. But, making a 64 bit version of RosAsm is impossible to port from the current Source code - That´s why i´m trying to fix it.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com