Hi,
i was looking for regex stuff lately and came across this library and include file from the old masm forums: http://www.masmforum.com/board/index.php?topic=15846.0
Had a look at the example, but what im looking for is an example of matching files. I have a filespec that i think is compiled correctly, and the result returns 1, what im looking to do is to match all, and have it return > 1, or have some way to loop and call pcre_exec and then extract the substring that matches. so far ive not had any success. Ive used a lot of same variables as in the example posted in the topic above.
.DATA
szFileSpec DB "^.+\.itm$",0 ; regex filespec to match for
szSubject DB 'foo.itm foo.bar.itmbaz itm.txt plop.itm test.itm boo.itm',0,0 ;test string, ive tried it null terminated as well, so not sure what is the best option here
ovector dd 3*(17+1) dup (0) ; 17 = number of subpatterns (...)
.DATA?
pszError PBYTE ?
iszError DD ?
hPCRE DD ?
hMATCH DD ?
LenSubject DD ?
lpMatchString DD ?
Compiling it i use this:
Invoke pcre_compile, Addr szFileSpec, PCRE_DUPNAMES, Addr pszError, Addr iszError,0
mov hPCRE, eax ; store compiled result in this variable for later use in match with pcre_exec
and the match im using:
Invoke pcre_exec, hPCRE, 0, Addr szSubject, LenSubject, 0, 0, Addr ovector, 3*(17+1)
mov hMATCH, eax
PrintDec hMATCH ; shows as 1
.IF hMATCH == PCRE_ERROR_NOMATCH
PrintText 'PCRE_ERROR_NOMATCH'
.ELSE
PrintText 'PCRE_MATCH'
; tried using pcre_get_substring and other functions as defined in the inc to get the substring without success, only one that appears to work is:
Invoke pcre_get_substring, Addr szSubject, Addr ovector, hMATCH, 0, Addr lpMatchString ; lpMatchString basically contains the full szSubject string
; any other value after the hMATCH part being > 0 doesnt contain pointer to any substring
; i went back and adjust szSubject to seperate the filenames with ',0,' to see if that would help, but only get and error in hMATCH, -2 = PCRE_ERROR_NULL
; thought of looping and calling pcre_exec and pcre_get_substring each time, but cant find a way of returning no of matched results into a variable for loop count.
.ENDIF
Im wondering if anyone has any experience using this library and can point me in the right direction for matching all files that match compiled regex pattern and how to get a list of matched files out into a buffer or something so i can use elsewhere, or a working example perhaps? thanks in advance.
Where does the file names come from?
Finding several matches in one subject is not that simple (in general): in the source repository (http://sourceforge.net/projects/pcre/files/pcre/8.37/) you can find the example pcredemo.c that shows how to proceed. For your case it might be simplified, because your pattern can't match empty strings, so you just have to update the start offset according to ovector[1].
If the file names does come from FindFirst/NextFile, things get much simpler, because you can match each file separately...
Hi qWord,
The filenames are from a script file that im parsing, the filespec for the prce is also in these scripts. I can collect the filenames together and put them into an array or as a space seperated string, or null terminated or anything. For testing id just put a few ('foo.itm foo.bar.itmbaz itm.txt plop.itm test.itm boo.itm') i was hoping to match in one long string. The test pcre is similar to *.itm but as i cant be sure they will all be so easy i might have to use the pcre to match them. Most of the examples i came across are equivalent to *.xxx but i cant really make that assumption just on the off chance one or two arent like that.
From some more reading i think i might need the pcre_dfa_exec, which says it will match all. Ill have a look at the source demo to see if i can figure out for sure what i need to do. Cheers.
If anyone has any working examples of course that would be very helpful as well.
I guess it is an low level API, so no chance to get all matches in one call.
For you application, I would stick with an array of file names and then match each array member separately.
Otherwise, it gets more complicated:
include \masm32\include\masm32rt.inc
include PCRE81S.inc
includelib PCRE81S.lib
.code
main proc
LOCAL pszError:PBYTE
LOCAL error_pos:DWORD
LOCAL pszPattern: ptr CHAR
LOCAL nCapturingSubpatterns:DWORD
LOCAL nCapturedSubstrings:SDWORD
LOCAL pOffsets:ptr SDWORD
LOCAL nOffsets:DWORD
LOCAL compiled_pattern:PVOID
LOCAL ccText:DWORD
LOCAL pText:DWORD
LOCAl stdout:HANDLE
LOCAl nWritten:DWORD
fnx stdout = GetStdHandle, STD_OUTPUT_HANDLE
mov pText, chr$("foo.itm foo.bar.itmbaz itm.txt plop.itm test.itm boo.itm")
mov ccText,len(pText)
mov pszPattern,chr$("\b[^\s]+\.itm\b")
; compile pattern
fnx compiled_pattern = pcre_compile, pszPattern, 0, &pszError, &error_pos, 0
.if !compiled_pattern
print "ERROR: "
print pszError,13,10
print "near position: "
add error_pos,1
print str$(error_pos),13,10
inkey
exit
.endif
; get number of capturing subpatterns from compiled pattern
fn pcre_fullinfo, compiled_pattern, 0, PCRE_INFO_CAPTURECOUNT, &nCapturingSubpatterns
.if eax != 0
print "error: fullinfo",13,10
invoke pcre_free, compiled_pattern
inkey
exit
.endif
; nOffsets = 3 * (nCapturingSubpatterns + 1)
mov eax,nCapturingSubpatterns
add eax,1
imul eax, 3
mov nOffsets,eax
; allocate array for offsets
imul ecx,eax,SIZEOF SDWORD
mov pOffsets,alloc(ecx)
.if !eax
print "Out of memory.",13,10
invoke pcre_free, compiled_pattern
inkey
exit
.endif
xor ebx,ebx
.while ebx < ccText
fnx nCapturedSubstrings = pcre_exec, compiled_pattern, 0, pText, ccText, ebx, 0, pOffsets, nOffsets
.if nCapturedSubstrings > 0
print chr$('"')
; get range of match
mov eax,pOffsets
mov edx,[eax][0*SDWORD] ; range begin
mov ecx,[eax][1*SDWORD] ; range end
; get range size
sub ecx,edx
; print match
add edx, pText
fn WriteFile, stdout, edx, ecx, &nWritten, NULL
print chr$('"'),13,10
.elseif nCapturedSubstrings == PCRE_ERROR_NOMATCH
; no match
.break
.else
; any error
print "error ...",13,10
.break
.endif
; continue search after last match
mov eax,pOffsets
mov ebx,[eax][1*SDWORD]
.endw
invoke pcre_free, compiled_pattern
free pOffsets
inkey
exit
main endp
end main
(important: the pattern does not allow empty matches)
Thanks for that example qWord, its helped a lot. I changed the tests strings to seperate them with a CR,LF char pair, used PCRE_MULTILINE as the option and it picks up the test ones that match the "^.+\.itm$" pattern. I can thus seperate the filenames when building the string/array into a bit of memory with the CR,LF and it should match them up based on that.
I appreciate your time and help, cheers.
Edit: just to clarify, if someone comes across this and wants to know exactly what i meant:
szFileSpec DB "^.+\.itm$",0
szSubject DB 'foo.itm',13,10,' foo.bar.itmbaz',13,10,'itm.txt',13,10,'plop.itm',13,10,'test.itm',13,10,'boo.itm',13,10,0,0
lea eax, szSubject
mov pText, eax
mov ccText,len(pText)
lea eax, szFileSpec
mov pszPattern, eax
mov options, (PCRE_CASELESS + PCRE_MULTILINE + PCRE_NO_UTF8_CHECK)
; compile pattern
fnx compiled_pattern = pcre_compile, pszPattern, options, &pszError, &error_pos, 0
test result output:
"foo.itm"
"plop.itm"
"test.itm"
"boo.itm"
which matches the pattern supplied: similar to *.itm, and doesnt match the other two items: foo.bar.itmbaz & itm.txt - which is correct.
Where could I find exemples of regex patterns?
For example links in html pages (#<a href=(.*?)>#), images... which are not case sensitives
For image I suppose : #<a src=(.*jp?g)>#
Only place i have seen the regex used for images it with the .htaccess file, using apache mod_rewrite to prevent hotlinking or leeching of images:
http://www.cyberciti.biz/faq/apache-mod_rewrite-hot-linking-images-leeching-howto/ (http://www.cyberciti.biz/faq/apache-mod_rewrite-hot-linking-images-leeching-howto/)
http://www.itechlounge.net/2012/01/web-prevent-image-hot-linking-to-your-site/
Not seen it directly in html files, unless its with php scripting and php has its own match function for regex
I have copied the code into this one :
The pattern is :
szImagePattern Byte "/(http:|data).+?\.(?:gif|png|jpe?g)[^",'"',"]*/ig",0
ProcessWebPage PROC USES EBX EDI ESI,__lpBuffer:LPBYTE,__dwFileSize:DWord
LOCAL _lpszError:LPBYTE
LOCAL _Error_Pos:DWord
LOCAL _nCapturingSubpatterns:DWord
LOCAL _nCapturedSubstrings:SDWORD
LOCAL _lpOffsets:PTR SDWORD
LOCAL _nOffsets:DWord
LOCAL _lpCompiled_Pattern:LPVOID
LOCAL _ccText:DWord
LOCAL _nWritten:DWord
DEBUG
INVOKE lstrlen,__lpBuffer
mov _ccText,eax
INVOKE pcre_compile,ADDR szImagePattern,PCRE_CASELESS+PCRE_MULTILINE+PCRE_UTF8,ADDR _lpszError,ADDR _Error_Pos,NULL
test eax,eax
jz @Finished
mov _lpCompiled_Pattern,eax
; get number of capturing subpatterns from compiled pattern
INVOKE pcre_fullinfo,_lpCompiled_Pattern,NULL,PCRE_INFO_CAPTURECOUNT,ADDR _nCapturingSubpatterns
test eax,eax
jne @NotFound
cmp eax,PCRE_ERROR_NOMATCH
je @NotFound
; _nOffsets = 3 * (_nCapturingSubpatterns + 1)
mov eax,_nCapturingSubpatterns
add eax,1
imul eax,3
mov _nOffsets,eax
; allocate array for offsets
imul ecx,eax,SIZEOF SDWORD
INVOKE Mem_Alloc,ecx
mov _lpOffsets,eax
.IF !eax
INVOKE pcre_free,_lpCompiled_Pattern
xor eax,eax
ret
.ENDIF
xor ebx,ebx
.WHILE ebx < _ccText
INVOKE pcre_exec,_lpCompiled_Pattern,NULL,__lpBuffer,_ccText,ebx,0,_lpOffsets,_nOffsets
mov _nCapturedSubstrings,eax
.IF _nCapturedSubstrings > 0
; get range of match
mov eax,_lpOffsets
mov edx,[eax][0*SDWORD] ; range begin
mov ecx,[eax][1*SDWORD] ; range end
; get range size
sub ecx,edx
; print match
add edx,__lpBuffer
INVOKE SearchString,edx,ecx
.ELSE
.BREAK
.ENDIF
; continue search after last match
mov eax,_lpOffsets
mov ebx,[eax][1*SDWORD]
.ENDW
INVOKE Mem_Free,_lpOffsets
INVOKE pcre_free, _lpCompiled_Pattern
mov eax,TRUE
ret
@NotFound :
INVOKE pcre_free,_lpCompiled_Pattern
xor eax,eax
ret
; **********************************************************************************
ALIGN 4
; **********************************************************************************
@Finished :
ret
ProcessWebPage ENDP
I want to locate an image into an html page.
Before this function is called, the web page is downloaded into __lpBuffer
I use PCRE81S.lib found in the old forum.
The function does not find any image, but I know ther is at less ONE.
Quote
Extract from the web page : http://www.larevueautomobile.com/HD-Sexy-modele_Girls-and-Cars-vue_Exterieur-img_Sexy_Girls_and_Cars_028.jpg-image?utm_source=ImageLRA&utm_medium=HD&utm_term=Image-HD (http://www.larevueautomobile.com/HD-Sexy-modele_Girls-and-Cars-vue_Exterieur-img_Sexy_Girls_and_Cars_028.jpg-image?utm_source=ImageLRA&utm_medium=HD&utm_term=Image-HD)
<li><a href="http://moto.larevueautomobile.com (http://moto.larevueautomobile.com/)" title="Moto"><img src="http://www.larevueautomobile.com/moto.png (http://www.larevueautomobile.com/moto.png)" class="menulien menulien2" alt="Moto icone" /><b>MOTO</b></a></li>
I need your help.
There are many images in that page:
include \masm32\MasmBasic\MasmBasic.inc ; download (http://masm32.com/board/index.php?topic=94.0)
Init
Let esi=FileRead$("http://www.larevueautomobile.com/HD-Sexy-modele_Girls-and-Cars-vue_Exterieur-img_Sexy_Girls_and_Cars_028.jpg-image?utm_source=ImageLRA&utm_medium=HD&utm_term=Image-HD")
xor ebx, ebx
.While 1
inc ebx
.Break .if !Extract$(esi, 'img src="', '"', xsLoop, 100)
PrintLine eax
.Endw
EndOfCode
Output:
http://www3.smartadserver.com/call/pubi/63121/490986/26008/S/[timestamp]/?
http://www3.smartadserver.com/call/pubi/63121/490986/26006/S/[timestamp]/?
http://www.larevueautomobile.com/logo/Nouveau-Logo-2014.jpg
http://www.larevueautomobile.com/accueil-LaRevueAutomobile.png
http://www.larevueautomobile.com/news.png
http://www.larevueautomobile.com/photo.png
http://www.larevueautomobile.com/fiche-technique.png
http://www.larevueautomobile.com/sport-auto.png
http://www.larevueautomobile.com/moto.png
http://www.larevueautomobile.com/lifestyle.png
/icone-2014/Magazine-Auto-Menu.png
http://www3.smartadserver.com/call/pubi/63121/490986/25988/S/[timestamp]/?
http://www.larevueautomobile.com/css-voiture/Plus-Photo.png
/images/Sexy/Girls-and-Cars/Sexy_Girls_and_Cars_028.jpg
http://www3.smartadserver.com/call/pubi/63121/490996/26762/M/[timestamp]/?
/photo-voiture/photo-voiture.php?src=/images/Sexy/Girls-and-Cars/Sexy_Girls_a/Sexy_Girls_and_Cars_001.jpg
/photo-voiture/photo-voiture.php?src=/images/Sexy/Girls-and-Cars/Sexy_Hot_ Ba/Sexy_Girls_and_Cars_029.jpg
/photo-voiture/photo-voiture.php?src=/images/Sexy/Girls-and-Cars/Sexy_Hot_ Ba/Sexy_Girls_and_Cars_031.jpg
/photo-voiture/photo-voiture.php?src=/images/Sexy/Girls-and-Cars/Sexy_Hot_ Ba/Sexy_Girls_and_Cars_007.jpg
/photo-voiture/photo-voiture.php?src=/images/Sexy/Girls-and-Cars/Sexy/Sexy_Hot_ Babes_Cars_197.jpg
http://www.larevueautomobile.com/css-voiture/Plus-Photo.png
/photo-voiture/photo-voiture.php?src=/images/Peugeot/208-Puretech110-Eat6/Exterieur/Peugeot_208_Puretech110_Eat6_010.jpg
/photo-voiture/photo-voiture.php?src=/images/Peugeot/Quartz/Exterieur/Peugeot_Quartz_005.jpg
/photo-voiture/photo-voiture.php?src=/images/Ford/Fiesta-Titanium-EcoBoost-2015/Exterieur/Ford_Fiesta_Titanium_EcoBoost_2015_012.jpg
/photo-voiture/photo-voiture.php?src=/images/Mitsubishi/Pajero-Long-Di-D-Instyle/Exterieur/Mitsubishi_Pajero_Long_Di_D_Instyle_028.jpg
/photo-voiture/photo-voiture.php?src=/images/Bmw/M5-E34-Evo/Exterieur/Bmw_M5_E34_Evo_008.jpg
/photo-voiture/photo-voiture.php?src=/images/KTM/1290-Super-Duke-R/Exterieur/KTM_1290_Super_Duke_R_003.jpg
http://www3.smartadserver.com/call/pubi/63121/490993/26762/M/[timestamp]/?
http://www3.smartadserver.com/call/pubi/63121/490986/25996/S/[timestamp]/?
What did I do if you were not here.
You are GOD
I will download MasmBasic asap.
Thanks a lot JJ
For reading internet file I do :
LoadWebPage PROC __hWnd:HWND,__lpszPageWeb:LPSTR
LOCAL _szWebFile[MAX_PATH]:Byte
LOCAL _dwSize:DWord
LOCAL _hUrlFile:HANDLE
LOCAL _hInternet:HANDLE
LOCAL _lpBuffer:LPBYTE
LOCAL _dwBytesRead:DWord
INVOKE VirtualAlloc,NULL,1024 * 1024 * 10,MEM_COMMIT+MEM_RESERVE,PAGE_EXECUTE_READWRITE
test eax,eax
jz @Error_00
mov _lpBuffer,eax
INVOKE InternetOpen,NULL,PRE_CONFIG_INTERNET_ACCESS,NULL,INTERNET_INVALID_PORT_NUMBER,0
test eax,eax
jz @Error_01
mov _hInternet,eax
INVOKE InternetOpenUrl,_hInternet,__lpszPageWeb,NULL,0,INTERNET_FLAG_RELOAD,0
test eax,eax
jz @Error_02
mov _hUrlFile,eax
INVOKE InternetReadFile,_hUrlFile,_lpBuffer,1024 * 1024 * 10,ADDR _dwBytesRead
test eax,eax
jz @Error_03
INVOKE InternetCloseHandle,_hUrlFile
INVOKE InternetCloseHandle,_hInternet
INVOKE AnalyseFile,__hWnd,_lpBuffer
INVOKE VirtualFree,_lpBuffer,0,MEM_RELEASE
mov eax,TRUE
ret
@Error_03 :
INVOKE InternetCloseHandle,_hUrlFile
@Error_02 :
INVOKE InternetCloseHandle,_hInternet
@Error_01 :
INVOKE VirtualFree,_lpBuffer,0,MEM_RELEASE
@Error_00 :
xor eax,eax
@Exit :
ret
LoadWebPage ENDP
Is there an other way to this. I thought that a GET would be better.