News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

PCRE lib anyone have examples of use?

Started by fearless, December 12, 2015, 07:08:42 AM

Previous topic - Next topic

fearless

Hi,

i was looking for regex stuff lately and came across this library and include file from the old masm forums: http://www.masmforum.com/board/index.php?topic=15846.0

Had a look at the example, but what im looking for is an example of matching files. I have a filespec that i think is compiled correctly, and the result returns 1, what im looking to do is to match all, and have it return > 1, or have some way to loop and call pcre_exec and then extract the substring that matches. so far ive not had any success. Ive used a lot of same variables as in the example posted in the topic above.


.DATA
szFileSpec              DB "^.+\.itm$",0 ; regex filespec to match for
szSubject               DB 'foo.itm foo.bar.itmbaz itm.txt plop.itm test.itm boo.itm',0,0 ;test string, ive tried it null terminated as well, so not sure what is the best option here
ovector                 dd 3*(17+1)  dup (0) ; 17 = number of subpatterns (...)
.DATA?
pszError                PBYTE ?
iszError                DD ?
hPCRE                   DD ?
hMATCH                  DD ?
LenSubject              DD ?
lpMatchString           DD ?


Compiling it i use this:
Invoke pcre_compile, Addr szFileSpec, PCRE_DUPNAMES, Addr pszError, Addr iszError,0
mov hPCRE, eax ; store compiled result in this variable for later use in match with pcre_exec


and the match im using:

        Invoke pcre_exec, hPCRE, 0, Addr szSubject, LenSubject, 0, 0, Addr ovector, 3*(17+1)
        mov hMATCH, eax
        PrintDec hMATCH ; shows as 1
        .IF hMATCH == PCRE_ERROR_NOMATCH
            PrintText 'PCRE_ERROR_NOMATCH'
        .ELSE
            PrintText 'PCRE_MATCH'
            ; tried using pcre_get_substring and other functions as defined in the inc to get the substring without success, only one that appears to work is:
            Invoke pcre_get_substring, Addr szSubject, Addr ovector, hMATCH, 0, Addr lpMatchString ; lpMatchString basically contains the full szSubject string
            ; any other value after the hMATCH part being > 0 doesnt contain pointer to any substring
            ; i went back and adjust szSubject to seperate the filenames with ',0,' to see if that would help, but only get and error in hMATCH, -2 = PCRE_ERROR_NULL
            ; thought of looping and calling pcre_exec and pcre_get_substring each time, but cant find a way of returning no of matched results into a variable for loop count.
        .ENDIF


Im wondering if anyone has any experience using this library and can point me in the right direction for matching all files that match compiled regex pattern and how to get a list of matched files out into a buffer or something so i can use elsewhere, or a working example perhaps? thanks in advance.


qWord

Where does the file names come from?
Finding several matches in one subject is not that simple (in general): in the source repository you can find the example pcredemo.c that shows how to proceed. For your case it might be simplified, because your pattern can't match empty strings, so you just have to update the start offset according to ovector[1].
If the file names does come from FindFirst/NextFile, things get much simpler, because you can match each file separately...
MREAL macros - when you need floating point arithmetic while assembling!

fearless

Hi qWord,

The filenames are from a script file that im parsing, the filespec for the prce is also in these scripts. I can collect the filenames together and put them into an array or as a space seperated string, or null terminated or anything. For testing id just put a few ('foo.itm foo.bar.itmbaz itm.txt plop.itm test.itm boo.itm') i was hoping to match in one long string. The test pcre is similar to *.itm but as i cant be sure they will all be so easy i might have to use the pcre to match them. Most of the examples i came across are equivalent to *.xxx but i cant really make that assumption just on the off chance one or two arent like that.

From some more reading i think i might need the pcre_dfa_exec, which says it will match all. Ill have a look at the source demo to see if i can figure out for sure what i need to do. Cheers.

If anyone has any working examples of course that would be very helpful as well.

qWord

I guess it is an low level API, so no chance to get all matches in one call.
For you application, I would stick with an array of file names and then match each array member separately.

Otherwise, it gets more complicated:
include \masm32\include\masm32rt.inc

include PCRE81S.inc
includelib PCRE81S.lib

.code
main proc
LOCAL pszError:PBYTE
LOCAL error_pos:DWORD
LOCAL pszPattern: ptr CHAR
LOCAL nCapturingSubpatterns:DWORD
LOCAL nCapturedSubstrings:SDWORD
LOCAL pOffsets:ptr SDWORD
LOCAL nOffsets:DWORD
LOCAL compiled_pattern:PVOID
LOCAL ccText:DWORD
LOCAL pText:DWORD
LOCAl stdout:HANDLE
LOCAl nWritten:DWORD
   
   
    fnx stdout = GetStdHandle, STD_OUTPUT_HANDLE   
   
    mov pText, chr$("foo.itm foo.bar.itmbaz itm.txt plop.itm test.itm boo.itm")
    mov ccText,len(pText)
   
    mov pszPattern,chr$("\b[^\s]+\.itm\b")
   
    ; compile pattern
    fnx compiled_pattern = pcre_compile, pszPattern, 0, &pszError, &error_pos, 0
    .if !compiled_pattern
        print "ERROR: "
        print pszError,13,10
        print "near position: "
        add error_pos,1
        print str$(error_pos),13,10
        inkey
        exit
    .endif

    ; get number of capturing subpatterns from compiled pattern
    fn pcre_fullinfo, compiled_pattern, 0, PCRE_INFO_CAPTURECOUNT, &nCapturingSubpatterns
    .if eax != 0
        print "error: fullinfo",13,10
        invoke pcre_free, compiled_pattern
        inkey
        exit
    .endif

    ; nOffsets = 3 * (nCapturingSubpatterns + 1)
    mov eax,nCapturingSubpatterns
    add eax,1
    imul eax, 3
    mov nOffsets,eax
   
    ; allocate array for offsets
    imul ecx,eax,SIZEOF SDWORD
    mov pOffsets,alloc(ecx)
    .if !eax
        print "Out of memory.",13,10
        invoke pcre_free, compiled_pattern
        inkey
        exit
    .endif

    xor ebx,ebx
    .while ebx < ccText
       
         fnx nCapturedSubstrings = pcre_exec, compiled_pattern, 0, pText, ccText, ebx, 0, pOffsets, nOffsets
        .if nCapturedSubstrings > 0
           
            print chr$('"')
           
            ; get range of match
            mov eax,pOffsets
            mov edx,[eax][0*SDWORD]     ; range begin
            mov ecx,[eax][1*SDWORD]     ; range end
           
            ; get range size
            sub ecx,edx
           
            ; print match
            add edx, pText
            fn WriteFile, stdout, edx, ecx, &nWritten, NULL
           
            print chr$('"'),13,10
           
        .elseif nCapturedSubstrings == PCRE_ERROR_NOMATCH
            ; no match
            .break
        .else
            ; any error
            print "error ...",13,10
            .break
        .endif
       
        ; continue search after last match
        mov eax,pOffsets
        mov ebx,[eax][1*SDWORD]
    .endw

    invoke pcre_free, compiled_pattern
    free pOffsets
   
    inkey
    exit
   
main endp
end main
(important: the pattern does not allow empty matches)
MREAL macros - when you need floating point arithmetic while assembling!

fearless

Thanks for that example qWord, its helped a lot. I changed the tests strings to seperate them with a CR,LF char pair, used PCRE_MULTILINE as the option and it picks up the test ones that match the "^.+\.itm$" pattern. I can thus seperate the filenames when building the string/array into a bit of memory with the CR,LF and it should match them up based on that.

I appreciate your time and help, cheers.


Edit: just to clarify, if someone comes across this and wants to know exactly what i meant:

szFileSpec DB "^.+\.itm$",0
szSubject DB 'foo.itm',13,10,' foo.bar.itmbaz',13,10,'itm.txt',13,10,'plop.itm',13,10,'test.itm',13,10,'boo.itm',13,10,0,0


    lea eax, szSubject
    mov pText, eax
    mov ccText,len(pText)
    lea eax, szFileSpec
    mov pszPattern, eax
    mov options, (PCRE_CASELESS + PCRE_MULTILINE + PCRE_NO_UTF8_CHECK)
    ; compile pattern
    fnx compiled_pattern = pcre_compile, pszPattern, options, &pszError, &error_pos, 0


test result output:
"foo.itm"
"plop.itm"
"test.itm"
"boo.itm"


which matches the pattern supplied: similar to *.itm, and doesnt match the other two items:  foo.bar.itmbaz & itm.txt - which is correct.

Grincheux

Where could I find exemples of regex patterns?
For example links in html pages (#<a href=(.*?)>#), images... which are not case sensitives
For image I suppose : #<a src=(.*jp?g)>#

fearless

Only place i have seen the regex used for images it with the .htaccess file, using apache mod_rewrite to prevent hotlinking or leeching of images:

http://www.cyberciti.biz/faq/apache-mod_rewrite-hot-linking-images-leeching-howto/
http://www.itechlounge.net/2012/01/web-prevent-image-hot-linking-to-your-site/

Not seen it directly in html files, unless its with php scripting and php has its own match function for regex

Grincheux

I have copied the code into this one :

The pattern is :


szImagePattern            Byte   "/(http:|data).+?\.(?:gif|png|jpe?g)[^",'"',"]*/ig",0

ProcessWebPage            PROC   USES EBX EDI ESI,__lpBuffer:LPBYTE,__dwFileSize:DWord
                     LOCAL   _lpszError:LPBYTE
                     LOCAL   _Error_Pos:DWord
                     LOCAL   _nCapturingSubpatterns:DWord
                     LOCAL   _nCapturedSubstrings:SDWORD
                     LOCAL   _lpOffsets:PTR SDWORD
                     LOCAL   _nOffsets:DWord
                     LOCAL   _lpCompiled_Pattern:LPVOID
                     LOCAL   _ccText:DWord
                     LOCAL   _nWritten:DWord
DEBUG
                     INVOKE   lstrlen,__lpBuffer

                     mov      _ccText,eax

                     INVOKE   pcre_compile,ADDR szImagePattern,PCRE_CASELESS+PCRE_MULTILINE+PCRE_UTF8,ADDR _lpszError,ADDR _Error_Pos,NULL

                     test   eax,eax
                     jz      @Finished

                     mov      _lpCompiled_Pattern,eax

                     ; get number of capturing subpatterns from compiled pattern
                     INVOKE   pcre_fullinfo,_lpCompiled_Pattern,NULL,PCRE_INFO_CAPTURECOUNT,ADDR _nCapturingSubpatterns

                     test   eax,eax
                     jne      @NotFound

                     cmp      eax,PCRE_ERROR_NOMATCH
                     je      @NotFound

                     ; _nOffsets = 3 * (_nCapturingSubpatterns + 1)
                     mov      eax,_nCapturingSubpatterns
                     add      eax,1
                     imul   eax,3
                     mov      _nOffsets,eax

                     ; allocate array for offsets
                     imul   ecx,eax,SIZEOF SDWORD

                     INVOKE   Mem_Alloc,ecx

                     mov      _lpOffsets,eax
                     .IF !eax
                        INVOKE   pcre_free,_lpCompiled_Pattern

                        xor      eax,eax
                        ret
                     .ENDIF

                     xor ebx,ebx

                     .WHILE ebx < _ccText
                          INVOKE   pcre_exec,_lpCompiled_Pattern,NULL,__lpBuffer,_ccText,ebx,0,_lpOffsets,_nOffsets
                          mov      _nCapturedSubstrings,eax
                        .IF _nCapturedSubstrings > 0
                           ; get range of match
                           mov      eax,_lpOffsets
                           mov      edx,[eax][0*SDWORD]     ; range begin
                           mov      ecx,[eax][1*SDWORD]     ; range end

                           ; get range size
                           sub      ecx,edx

                           ; print match
                           add      edx,__lpBuffer
                           INVOKE   SearchString,edx,ecx
                        .ELSE
                           .BREAK
                        .ENDIF
                        ; continue search after last match
                        mov      eax,_lpOffsets
                        mov      ebx,[eax][1*SDWORD]
                     .ENDW

                     INVOKE   Mem_Free,_lpOffsets
                     INVOKE   pcre_free, _lpCompiled_Pattern

                     mov      eax,TRUE
                     ret

@NotFound :

                     INVOKE   pcre_free,_lpCompiled_Pattern

                     xor      eax,eax
                     ret

;   **********************************************************************************
                     ALIGN   4
;   **********************************************************************************

@Finished :

                     ret
ProcessWebPage            ENDP


I want to locate an image into an html page.
Before this function is called, the web page is downloaded into __lpBuffer

I use PCRE81S.lib found in the old forum.
The function does not find any image, but I know ther is at less ONE.

Quote
Extract from the web page : http://www.larevueautomobile.com/HD-Sexy-modele_Girls-and-Cars-vue_Exterieur-img_Sexy_Girls_and_Cars_028.jpg-image?utm_source=ImageLRA&amp;utm_medium=HD&amp;utm_term=Image-HD

<li><a href="http://moto.larevueautomobile.com" title="Moto"><img src="http://www.larevueautomobile.com/moto.png" class="menulien menulien2" alt="Moto icone"  /><b>MOTO</b></a></li>

I need your help.

jj2007

There are many images in that page:

include \masm32\MasmBasic\MasmBasic.inc      ; download
  Init
  Let esi=FileRead$("http://www.larevueautomobile.com/HD-Sexy-modele_Girls-and-Cars-vue_Exterieur-img_Sexy_Girls_and_Cars_028.jpg-image?utm_source=ImageLRA&utm_medium=HD&utm_term=Image-HD")
  xor ebx, ebx
  .While 1
      inc ebx
      .Break .if !Extract$(esi, 'img src="', '"', xsLoop, 100)
      PrintLine eax
  .Endw
EndOfCode


Output:
http://www3.smartadserver.com/call/pubi/63121/490986/26008/S/[timestamp]/?
http://www3.smartadserver.com/call/pubi/63121/490986/26006/S/[timestamp]/?
http://www.larevueautomobile.com/logo/Nouveau-Logo-2014.jpg
http://www.larevueautomobile.com/accueil-LaRevueAutomobile.png
http://www.larevueautomobile.com/news.png
http://www.larevueautomobile.com/photo.png
http://www.larevueautomobile.com/fiche-technique.png
http://www.larevueautomobile.com/sport-auto.png
http://www.larevueautomobile.com/moto.png
http://www.larevueautomobile.com/lifestyle.png
/icone-2014/Magazine-Auto-Menu.png
http://www3.smartadserver.com/call/pubi/63121/490986/25988/S/[timestamp]/?
http://www.larevueautomobile.com/css-voiture/Plus-Photo.png
/images/Sexy/Girls-and-Cars/Sexy_Girls_and_Cars_028.jpg
http://www3.smartadserver.com/call/pubi/63121/490996/26762/M/[timestamp]/?
/photo-voiture/photo-voiture.php?src=/images/Sexy/Girls-and-Cars/Sexy_Girls_a/Sexy_Girls_and_Cars_001.jpg
/photo-voiture/photo-voiture.php?src=/images/Sexy/Girls-and-Cars/Sexy_Hot_ Ba/Sexy_Girls_and_Cars_029.jpg
/photo-voiture/photo-voiture.php?src=/images/Sexy/Girls-and-Cars/Sexy_Hot_ Ba/Sexy_Girls_and_Cars_031.jpg
/photo-voiture/photo-voiture.php?src=/images/Sexy/Girls-and-Cars/Sexy_Hot_ Ba/Sexy_Girls_and_Cars_007.jpg
/photo-voiture/photo-voiture.php?src=/images/Sexy/Girls-and-Cars/Sexy/Sexy_Hot_ Babes_Cars_197.jpg
http://www.larevueautomobile.com/css-voiture/Plus-Photo.png
/photo-voiture/photo-voiture.php?src=/images/Peugeot/208-Puretech110-Eat6/Exterieur/Peugeot_208_Puretech110_Eat6_010.jpg
/photo-voiture/photo-voiture.php?src=/images/Peugeot/Quartz/Exterieur/Peugeot_Quartz_005.jpg
/photo-voiture/photo-voiture.php?src=/images/Ford/Fiesta-Titanium-EcoBoost-2015/Exterieur/Ford_Fiesta_Titanium_EcoBoost_2015_012.jpg
/photo-voiture/photo-voiture.php?src=/images/Mitsubishi/Pajero-Long-Di-D-Instyle/Exterieur/Mitsubishi_Pajero_Long_Di_D_Instyle_028.jpg

/photo-voiture/photo-voiture.php?src=/images/Bmw/M5-E34-Evo/Exterieur/Bmw_M5_E34_Evo_008.jpg
/photo-voiture/photo-voiture.php?src=/images/KTM/1290-Super-Duke-R/Exterieur/KTM_1290_Super_Duke_R_003.jpg
http://www3.smartadserver.com/call/pubi/63121/490993/26762/M/[timestamp]/?
http://www3.smartadserver.com/call/pubi/63121/490986/25996/S/[timestamp]/?

Grincheux

What did I do if you were not here.
You are GOD
I will download MasmBasic asap.


Thanks a lot JJ

Grincheux

For reading internet file I do :


LoadWebPage PROC __hWnd:HWND,__lpszPageWeb:LPSTR
LOCAL _szWebFile[MAX_PATH]:Byte
LOCAL _dwSize:DWord
LOCAL _hUrlFile:HANDLE
LOCAL _hInternet:HANDLE
LOCAL _lpBuffer:LPBYTE
LOCAL _dwBytesRead:DWord

INVOKE VirtualAlloc,NULL,1024 * 1024 * 10,MEM_COMMIT+MEM_RESERVE,PAGE_EXECUTE_READWRITE

test eax,eax
jz @Error_00

mov _lpBuffer,eax

INVOKE InternetOpen,NULL,PRE_CONFIG_INTERNET_ACCESS,NULL,INTERNET_INVALID_PORT_NUMBER,0

test eax,eax
jz @Error_01

mov _hInternet,eax

INVOKE InternetOpenUrl,_hInternet,__lpszPageWeb,NULL,0,INTERNET_FLAG_RELOAD,0

test eax,eax
jz @Error_02

mov _hUrlFile,eax

INVOKE InternetReadFile,_hUrlFile,_lpBuffer,1024 * 1024 * 10,ADDR _dwBytesRead

test eax,eax
jz @Error_03

INVOKE InternetCloseHandle,_hUrlFile
INVOKE InternetCloseHandle,_hInternet

INVOKE AnalyseFile,__hWnd,_lpBuffer
INVOKE VirtualFree,_lpBuffer,0,MEM_RELEASE

mov eax,TRUE
ret

@Error_03 :

INVOKE InternetCloseHandle,_hUrlFile

@Error_02 :

INVOKE InternetCloseHandle,_hInternet

@Error_01 :

INVOKE VirtualFree,_lpBuffer,0,MEM_RELEASE

@Error_00 :

xor eax,eax

@Exit :

ret
LoadWebPage ENDP


Is there an other way to this. I thought that a GET would be better.