News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

SIMD_BinaryScan (Instring equivalent)

Started by guga, March 17, 2025, 05:27:34 PM

Previous topic - Next topic

guga

Hi Larry

Tks a lot

Better would use the disassembled version, until someone port it to the default masm macros set. I gave a try porting it, but i´m not used to masm syntax since years, and had to translate the macros with ChatGPT (In fact, deepseek did the translation).

But, if you are used to masm macros system, it shouldn't be hard to port. RosAsm If macro set works the same as in Masm version the If_Or and If_And macros in RosAsm with maybe similar to If xxx= 5 & yy= 6  or If xxx= 10 OR yyy = 30 (If i remember, masm If macro set allowed something like that, right ? I mean, using multiple If variables on the same line with some boolean &, OR, right ?)

The while and end_while macro in RosAsm, is the same as while endwhile in masm, except it don´t accept multiple conditions.
The do loop_until macro in RosAsm is also similar to while macro in masm, except it don´t have the jmp opcode at the start. so, it works like:

Start:
   mov eax 5
   dec eax
   cmp eax 0 | jne Start

or je, jle etc etc depending teh used RosAsm token, but the functionality is like that.

The uses macro in rosasm is just to preserve registers on a procedure. It´s a simple push reg at the start of a function and correspondant pop reg at the end (i guess masm has also something like that).

In RosAsm to make the source better to be read we can use separators "|" to dislay the code on a same line. THis is unused during compilation time, it´s just a token to make the code more readable, specially if the code have several lines, and we want it do display in one only. So

mov eax 5000 |  shr eax 3 | inc eax

is the same as
mov eax 5000
shr eax 3
inc eax

The usage of ","  in between the opcodes is optional. So you can also write it as:
mov eax, 5000
shr eax, 3
inc eax

Or any equivalent in Masm. - I personally, don´t use the comma´s too often for short opcodes. I use it more in call function with several parameters. But in RosAsm, it is optional, it won´t affect at all the code since it is handled by the assembler to simply skip those chars on certain conditions.

Now, if the function really work as expected, now it´s time to do the backwards version of it. So, a mirror function to works backwards searching the pattern from end to the beginning of the data chunk. I already did that, but since i updated the forward version, i´ll try to see if i can do the same with the backwards
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

zedd151

I'm not very good with 'real' macros, just simple macros.
Using the disassembled version might assemble fine, and actually work, but it would require some modifications regarding all of the 'loc_004xxxxx' labels, etc. and a few other things to make it a proper source code. I don't have a lot of time to do all of that (I have projects of my own to work on), but yes it is possible to do.

Hopefully, someone else here can help you to make a proper Masm32 compatible version of these functions. It looks like a very promising and useful set of functions.   :thumbsup:

Cheers...
Larry
¯\_(ツ)_/¯   :azn:

'As we don't do "requests", show us your code first.'  -  hutch—

ognil

Congratulations Guga for SIMD_BinaryScan (equivalent to Instring), :thumbsup:

It's great that you continue to have fun and feed the forum with interesting 32 bit code that our eminent translator zed151 tests and comments on. :badgrin:   
I'm also grateful for the IDA translation into MasmCompatible.asm (unfortunately without any comments on the individual lines). :undecided: 
For me, this is a perfect example of amateurish attitude towards the Instring function, because:
1. There is no check in the forum or by AI for an existing function written by a professional.
2. Splitting a simple function into three separate functions, etc.  :smiley:

I'll stop here and suggest the following:
1. Take a look at the following elegant function

;*******************************************************************;
; Immediate byte constants
;EQUAL_ANY     = 0000b
;RANGES     = 0100b
;EQUAL_EACH     = 1000b
;EQUAL_ORDERED     = 1100b
;NEGATIVE_POLARITY  = 010000b
;BYTE_MASK     = 1000000b
;*******************************************************************************************************;
align 16
db 8 dup (90h)
InstrOg proc 
mov         rax,  rcx                    ; rcx = haystack, rdx = needle   
movdqu     xmm2, oword ptr[rdx] ; load 1st 16 bytes of needle
pxor        xmm3, xmm3
@Loop: ; find the first possible match of 16-byte fragment in haystack       
pcmpistri   xmm2, oword ptr[rax], EQUAL_ORDERED
lea         rax, [rax+16]
ja          @Loop
lea         rax, [rax+rcx-16] ; save the possible match start
mov         r9, rdx                      ; r9 = rdx rdx = needle
mov         r8, rax
jnc         @NotFound
sub         r9, rax
@@: ; compare the strings
movdqu      xmm1, oword ptr [r8 + r9]
pcmpistrm   xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK ; mask out invalid bytes in the haystack
movdqu      xmm4, oword ptr[r8]
pand        xmm4, xmm0
pcmpistri   xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
lea         r8, [r8+16]
ja          @b
jnc         @Found
add         rax,1 ; continue searching from the next byte
jmp         @Loop
@NotFound:
xor         eax, eax
@Found:
ret
InStrOg endp
;*****************************************************************;
 

2. If you like it, just request a translation to MASM32 and testing from zed151.
I hope you continue to have fun and feed the forum with interesting code. :thumbsup:
"Not keeping emotions under control is another type of mental distortion."

guga

#18
Tks, but Couldn´t insert the comments on each line due to the limitations of the amount of characters on the forum. But, the comments on each line are displayed ion the RosAsm version, so it wouldn´t be hard to follow.

"For me, this is a perfect example of amateurish attitude towards the Instring function, because:
1. There is no check in the forum or by AI for an existing function written by a professional."

About the code...amatory attitude ? Not at all, there are papers describing the same method I used, scanning the data from 16 bytes to 16 bytes. Using 3 functions was mainly to avoid extrapolating the limits of a SSE register and also working on small datasets and small patterns. This function is not yet to be used as a instring on the normal way, since it does not look for null terminated byte, it is used to scan for a chunk of data (pattern) of any size on a dataset also of any size.

Also, it is for 32 bits and not 64 and so far, I´m limited to test with SSE2 mainly.

Your code, although shorter, is for 64 bits, using SSE4, but is limited to 16-byte aligned strings, needles bigger or equal to 16 bytes, null-terminated strings, cases where haystack is large enough, since it seems to not check the size of the pattern or the dataset itself.
(Unless, the code you posted is the core of another function that already check those limitations)

I would like to test it, but I can't port it to 32 bits and it is needed (for now) to be limited to SSE2.

I´ll try to port it to 32 bits using only SSE2 and if succeeded, I´ll post here the results
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

guga

I gave a test trying to port it to 32 bits, but unsure if it is the proper equivalence. If the code below is the equivalent for your´s 64 bits version, then, it is not working:

Is this the proper porting to 32 bits using only SSE2 ?
Proc InstrOg:
    Arguments @pHaystack, @pNeedle
    Uses esi, edi, ebx

    mov eax D@pHaystack            ; eax = haystack
    mov esi D@pNeedle
    movdqu xmm2 X$esi        ; load 1st 16 bytes of needle into xmm2
    pxor xmm3 xmm3                 ; xmm3 = 0 (used for comparisons)

@MainLoop:
    ; Compare 16 bytes of haystack with the needle
    movdqu xmm1 X$eax              ; load 16 bytes of haystack into xmm1
    pcmpeqb xmm1 xmm2              ; compare bytes for equality
    pmovmskb ecx xmm1              ; create a bitmask of the comparison results
    test ecx ecx                   ; check if any bytes matched
    jnz @PossibleMatch              ; if any match, jump to PossibleMatch

    add eax 16                     ; move to the next 16-byte block in haystack
    jmp @MainLoop                   ; continue searching

@PossibleMatch:
    ; A possible match was found. Now verify the full string.
    mov esi eax                    ; esi = start of the possible match in haystack
    mov edi D@pNeedle              ; edi = needle
    mov ebx 16                     ; ebx = number of bytes to compare

@VerifyLoop:
    movdqu xmm1 X$esi              ; load 16 bytes of haystack into xmm1
    movdqu xmm4 X$edi              ; load 16 bytes of needle into xmm4
    pcmpeqb xmm1 xmm4              ; compare bytes for equality
    pmovmskb ecx xmm1              ; create a bitmask of the comparison results
    cmp ecx 0FFFF                 ; check if all 16 bytes matched
    jne @NoMatch                    ; if not, jump to NoMatch

    add esi 16                     ; move to the next 16-byte block in haystack
    add edi 16                     ; move to the next 16-byte block in needle
    sub ebx 16                     ; decrement the remaining bytes to compare
    jnz @VerifyLoop                 ; if more bytes to compare, continue

    ; Full match found
    mov eax esi                    ; return the start of the match
    jmp @Done

@NoMatch:
    ; No match, continue searching from the next byte
    add eax 1                      ; move to the next byte in haystack
    jmp @MainLoop                   ; continue searching

@Done:
    ; Restore registers and return
    mov eax esi                    ; return the start of the match (or 0 if not found)

EndP



[TestingString2: B$ "578987hGugf8ahj1g1ghdlgjGughjkhjk8ahj18ahj1g1ghdmerdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj", 0]
[CharToFind2: B$ "merdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj", 0]

call InstrOg TestingString2, CharToFind2
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

six_L

Hi,guga

His codes works fine.
.data
; Define example data
haystack db "578987hGugf8ahj1g1ghdlgjGughjkhjk8ahj18ahj1g1ghdmerdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj", 0
needle db "merdalgjg1ghdlgj8a8ahj1g1ghdlgjhj1g1ghdlgj", 0
.code
;¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤
InStrOg proc plBuf:QWORD,plNed:QWORD

mov         rax,  rcx                    ; rcx = haystack, rdx = needle   
movdqu     xmm2, oword ptr[rdx] ; load 1st 16 bytes of needle
pxor        xmm3, xmm3
@Loop: ; find the first possible match of 16-byte fragment in haystack       
pcmpistri   xmm2, oword ptr[rax], EQUAL_ORDERED
lea         rax, [rax+16]
ja          @Loop
lea         rax, [rax+rcx-16] ; save the possible match start
mov         r9, rdx                      ; r9 = rdx rdx = needle
mov         r8, rax
jnc         @NotFound
sub         r9, rax
@@: ; compare the strings
movdqu      xmm1, oword ptr [r8 + r9]
pcmpistrm   xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK ; mask out invalid bytes in the haystack
movdqu      xmm4, oword ptr[r8]
pand        xmm4, xmm0
pcmpistri   xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
lea         r8, [r8+16]
ja          @b
jnc         @Found
add         rax,1 ; continue searching from the next byte
jmp         @Loop
@NotFound:
xor         rax, rax
@Found:
ret

InStrOg endp
...
invoke InStrOg, addr haystack, addr needle
mov result, rax

; Check the result
.if result == 0
; If the result is 0, the pattern was not found.
invoke AddLog,CStr("Pattern not found.",13,10)
.else
; If the result is non-zero, the pattern was found.
invoke  RtlZeroMemory,addr szBuf,sizeof szBuf
mov rax,result
lea rcx,haystack
sub rax,rcx
invoke wsprintf,addr szBuf,CStr("Pattern found at position(offset): %d ",13,10),rax
invoke AddLog,addr szBuf
.endif

QuotePattern found at position(offset): 48
Say you, Say me, Say the codes together for ever.

guga

Hi Six_L. Tks for testing

So, it was how i translate it to 32 bits that isn't working. I cannot test 64 bits apps in RosAsm right now. If someone could port this to 32 bits and with SSE2 it could be better for test. But, his version limits the size of strings and patterns or extrapolates the boundaries of a SSE register ? Those are the issues i would like to check.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com

ognil

QuoteIf someone could port this to 32 bits and with SSE2 it could be better for test.

And wouldn't it be better if you changed it to:
 If someone could port my code  to 64 bits  it could be better for test. Аnyway... :thumbsup:

Thanks six_L for your translation, but I don't understand why it includes 2 unnecessary variables that are not used anywhere in the code. I mean plBuf:QWORD and plNed:QWORD. :smiley:
"Not keeping emotions under control is another type of mental distortion."

six_L

Hi,ognil
QuoteI don't understand why it includes 2 unnecessary variables that are not used anywhere in the code. I mean plBuf:QWORD and plNed:QWORD.
I like the macros. Using the macros makes it easier to grasp the correctness of code from a macroscopic angles. if i have not used the "invoke" macros. then the code change the following:
    invoke InStrOg, addr haystack, addr needle
    mov result, rax
==>
   
    mov    rcx,offset haystack
    mov    rdx,offset needle       
    call     InStrOg
    mov     result, rax

Regards
six_L
Say you, Say me, Say the codes together for ever.

ognil

#24
Thanks to six_L for trying to explain the presence of two unused variables in your code.

Unfortunately, this is wrong, because there is nothing different in the result of the two calls you showed, from the debugger's perspective:
invoke InStrOg,addr szText,addr szText31 generate in debugger MS VC:
000000014000113D 48 8D 0D 6C 8D 06 00 lea   rcx,[szText (0140069EB0h)] 
0000000140001144 48 8D 15 74 8D 06 00 lea   rdx,[szText31 (0140069EBFh)] 
000000014000114B E8 3D 82 00 00       call  InStrOg (014000938Dh)

In other words, the macro does not write the variables plBuf:QWORD and plNed:QWORD into the code!!.
This is done by the programmer additionally, without being necessary in this case.
This is one of the reasons why I do not use macros, because then a mandatory check with the debugger must be done.
"Not keeping emotions under control is another type of mental distortion."

TimoVJL

/Fl⟦filename⟧   Generates an assembled code listing. See /Sf.
May the source be with you

_japheth

Quote from: ognil on March 20, 2025, 05:40:18 AM2.   
Quotemov  rcx, offset haystack ;    mov    rdx,  offset needle   
Тhis may or may not work depending on how big the variable address is in memory.

Isn't this a questionable statement? "MOV reg64,imm" is actually one of the few instructions that allow 64-bit immediates as source operand (so "mov rdx,offset xxx" is supposed to work for ANY 64-bit address). OTOH, "LEA reg64, [mem64]" has a displacement operand of 32-bit only, so it's restricted to either PC +/- 31-bit offset or, for non-PC-relative addressing, to address the first 4 GB only.

Dummheit, gepaart mit Dreistigkeit - eine furchtbare Macht.

six_L

Hi,ognil

1,
QuoteIn other words, the macro does not write the variables plBuf:QWORD and plNed:QWORD into the code!!.
This is done by the programmer additionally, without being necessary in this case.
your "InStrOg" PROC have two parameters[in]. i added the variables plBuf:QWORD and plNed:QWORD into the code for receiving the inputted parameters.
"addr haystack" --> "plBuf" --> "rcx"
"addr needle"   --> "plNed" --> "rdx"

Another way(remove the variables plBuf:QWORD and plNed:QWORD):
InStrOg proc
   
    mov         rax,  rcx                       ; rcx = haystack, rdx = needle   
    movdqu        xmm2, oword ptr[rdx]        ; load 1st 16 bytes of needle
    pxor        xmm3, xmm3
@Loop:                            ; find the first possible match of 16-byte fragment in haystack       
    pcmpistri   xmm2, oword ptr[rax], EQUAL_ORDERED
    lea         rax, [rax+16]
    ja          @Loop
    lea         rax, [rax+rcx-16]            ; save the possible match start
    mov         r9, rdx                          ; r9 = rdx rdx = needle
    mov         r8, rax
    jnc         @NotFound
    sub         r9, rax
@@:                            ; compare the strings
    movdqu      xmm1, oword ptr [r8 + r9]
    pcmpistrm   xmm3, xmm1, EQUAL_EACH + NEGATIVE_POLARITY + BYTE_MASK ; mask out invalid bytes in the haystack
    movdqu      xmm4, oword ptr[r8]
    pand        xmm4, xmm0
    pcmpistri   xmm1, xmm4, EQUAL_EACH + NEGATIVE_POLARITY
    lea         r8, [r8+16]
    ja          @b
    jnc         @Found
    add         rax,1                ; continue searching from the next byte
    jmp         @Loop
@NotFound:
    xor         rax, rax
@Found:
    ret

InStrOg endp   
...
    ; Call the SIMD_BinaryScan function
    lea    rcx,haystack
    lea    rdx,needle
    invoke    InStrOg
    mov    result, rax

    ; Check the result
    .if result == 0
        ; If the result is 0, the pattern was not found.
        invoke    AddLog,CStr("Pattern not found.",13,10)
    .else
        ; If the result is non-zero, the pattern was found.
        invoke  RtlZeroMemory,addr szBuf,sizeof szBuf
        mov    rax,result
        lea    rcx,haystack
        sub    rax,rcx
        invoke    wsprintf,addr szBuf,CStr("Pattern found at position(offset): %d ",13,10),rax
        invoke    AddLog,addr szBuf
    .endif

2,
in UASM64
"lea    rcx,haystack" == "mov rcx, offset haystack"

Regards
six_L
Say you, Say me, Say the codes together for ever.

ognil

Thanks guys, :smiley:

I don't want to argue with you because this thread is basically cleared of comments and we should respect the work of others. When you post your own competing code against Guga's code, I would be happy to comment on it in detail. :thumbsup:
"Not keeping emotions under control is another type of mental distortion."

guga

#29
New version coming soon. Succeeded to make no errors so far, and kept the speed for both (match and non match cases).  Also succeeded (finally) to force the data to always stays within the limits of a SSE registers. On this way we have 0 chance to an error when trying to load data beyond the limits.. Ex:
[DataSet: B$ "Hello world, i´m trying to find a string", 0] ; last byte is "g" and not 0
[Pattern: B$ "string", 0] ; 1st and last chars are s and g, respectively

On this, way, if the loop happens to be in "ind a string" - < pointer in esi is here. Then, when it try to load in xmm0, since this chunk has only 12 bytes long, it cannot load the extra 13 to 16 bytes. (We are not loading only strings, and we have no guarantees that we still have enough memory after the ending "g" to be loaded, since the data can be located at the very end of a block of memory or physical data (ex: we are trying to locate a chunk of bytes on the end of the .data section of a PE or other type of file and the memory ends immediately after the "g")

I succeeded to get no overheads on the algorithm. Now, i´m trying to improve it a little bit, tweaking how it can handle now chunks of data smaller than 31 bytes long, since those are related both as the remainder bytes of a long chain of data and also as small datasets to be scanned.

So, far, in both i´ve got the worst speed of around 90-96 clock cycles on the 167 types of data used to scan and around 40-43 clock cycles the fastest on my AMD.


Slowest Match cases

Clock cycles: 96 clocks
Example Case: 35

Fastest Match cases

Clock cycles: 43 clocks
Example Case: 21
_______________________________________________

Not Found Min Max Timmings:
_______________________________________________

Slowest Match cases

Clock cycles: 73 clocks
Example Case: 19

Fastest Match cases

Clock cycles: 32 clocks
Example Case: 83


Press enter to exit...


I´ll see if the approach i´m planning to handle those data smaller than 32 bytes will have any impact on the performance of the algorithm. I´ll try creating a hybrid algorithm that can handle only this small blocks of data using SSE2 registers and normal x86 registers (for even smaller chunks) to see if it is worthful than using the currently function as (Which is a bit slow, btw):
Proc CompareSmallDataBlock:
    Arguments @pDataBuffer, @DataSize, @pSearchPattern, @PatternSize
    Local @EndString
    Uses esi, edi, edx, ecx

    ; Initialize pointers.
    mov esi D@pDataBuffer
    mov eax D@DataSize
    lea eax D$esi+eax | sub eax D@PatternSize | mov D@EndString eax

    mov edi D@pSearchPattern
    .Do
        ; Compare the pattern byte-by-byte.
        mov ecx D@PatternSize
        xor eax eax
        Do
            movzx edx B$esi+eax | cmp dl B$edi+eax | jne L1>
            inc eax
            dec ecx
        Repeat_Until_Zero
        ; If the pattern matches, return the starting position.
        mov eax esi
        ExitP

    L1:
        ; Move to the next position in the dataset.
        inc esi
    .Loop_Until esi > D@EndString

    ; If no match is found, return 0.
    xor eax eax

EndP



Note: I tested the suggested routines described in several papers (Link1 , link2 etc etc), but all of them failed in terms of performance, speed and reliability, since they all extrapolated the boundaries of SSE2 registers. Also in some of those papers they suggested that we could ignore the comparison of the 1st and last char before it pass to the core compare function, but this is useless. No actual gain of speed whatsoever. Forcing it to recognize the size of the dataset to just advance 1 bytes and reduce the size in 2, so far caused no impact. IN fact, it leaded the algo to be a bit slower in 3 or 4 clock cycles.
The technique described in those papers are similar to the one i created, except to the fact that my algo respects the limits for a SSE2 register and also is automatically adjustable to any size of data and any size of pattern. The problem with many of those algorithms i tested so far (the one in github as well), is that when it enter on the loop to compare the possible matches, it does not calculate only the matching pairs. Instead, it perform more loops than necessary, and also don´t check the true limit of the dataset , which is not only the cause of the limit extrapolation but also make the functions to be slower because it continues to loop until it reaches the true end of the data, rather than estimate the maximum limit to be performed.  For, instance, if the pattern is 10 bytes long, and after all the loops, the ending pointer still needs to check for some remainders. The algorithm must terminate if the rest of the data to be scanned is smaller than the pattern. I.e: say we reach the limit for the remainders, and there´s still 5 bytes to the real end of the dataset. If the pointer is located somewhere between the 10th byte of true end of the data and the limit, then there´s no room to scan, because we cannot have any pattern beyond that. Ex: esi ends in 10 but the true data ending is located in 17. The pattern size = 10 bytes, but, if we are at the end of the limit, the amount of bytes to the end is only 7 bytes, so smaller then the patternsize itself.

Attached the new benchmark version to test.
Coding in Assembly requires a mix of:
80% of brain, passion, intuition, creativity
10% of programming skills
10% of alcoholic levels in your blood.

My Code Sites:
http://rosasm.freeforums.org
http://winasm.tripod.com