News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests
NB: Posting URL's See here: Posted URL Change

Main Menu

Getting html content of google search

Started by Siekmanski, December 01, 2013, 01:45:23 AM

Previous topic - Next topic

jj2007

Quote from: Siekmanski on December 04, 2013, 04:09:43 AMEdit: better to read .asm file in new attachment.

Hmmm... hidden payload?  ;)

Output:

Internet server

Satus Code: 200 OK

CONTENT_LENGTH: 527516

http://weirdestband.files.wordpress.com/2011/11/rammstein.jpg

Saving Rammstein.jpg ....


It takes a while, though... ca. 30 seconds or so.

dedndave

works here, Marinus   :t
XP SP3
maybe 20 seconds - didn't time it - lol
but, it's a big image

don't know where Jochen got 527516
the one i got was a little over 4 MB

Siekmanski

QuoteHmmm... hidden payload?  ;)

No, just forgot to change tabs to spaces to make the source code more readable.  :biggrin:

30 seconds, that's a long long time.
I'll rewrite the code to search and load every 1024 bytes at a time, that should speed things up.
Creative coders use backward thinking techniques as a strategy.

GoneFishing

Win8 32 bit : works OK
Quote
Internet server

Satus Code: 200 OK


CONTENT_LENGTH: 526874

http://weirdestband.files.wordpress.com/2011/11/rammstein.jpg

Saving Rammstein.jpg ....


Press any key to continue...


The image is the same as in Jochen's post but the CONTENT_LENGTH DIFFERS

Siekmanski

Thanks guys  :biggrin:

4MB that's also part of the long time i guess, but you can search for smaller images if you like.

(TBM=isch)

When you search for images, TBM=isch, you can also use the following TBS values:

•Large images: tbs=isz:l
•Medium images: tbs=isz:m
•Icon sized images: tba=isz:i
•Image size larger than 400×300: tbs=isz:lt,islt:qsvga
•Image size larger than 640×480: tbs=isz:lt,islt:vga
•Image size larger than 800×600: tbs=isz:lt,islt:svga
•Image size larger than 1024×768: tbs=isz:lt,islt:xga
•Image size larger than 1600×1200: tbs=isz:lt,islt:2mp
•Image size larger than 2272×1704: tbs=isz:lt,islt:4mp
•Image sized exactly 1000×1000: tbs=isz:ex,iszw:1000,iszh:1000
•Images in full color: tbs=ic:color
•Images in black and white: tbs=ic:gray
•Images that are red: tbs=ic:specific,isc:red [orange, yellow, green, teal, blue, purple, pink, white, gray, black, brown]
•Image type Face: tbs=itp:face
•Image type Photo: tbs=itp:photo
•Image type Clipart: tbs=itp:clipart
•Image type Line drawing: tbs=itp:lineart
•Group images by subject: tbs=isg:to
•Show image sizes in search results: tbs=imgo:1

Example URL: Search in images for "michael jackson" as a phrase, and limit results to 4 megapixel images or larger, color images, face images, and group the results by topic:

http://www.google.com/search?q=%22michael+jackson%22&tbm=isch&tbs=ic:color,isz:lt,islt:4mp,itp:face,isg:to


Creative coders use backward thinking techniques as a strategy.

Siekmanski

QuoteThe image is the same as in Jochen's post but the CONTENT_LENGTH DIFFERS

The content differs from time to time, maybe it's updated then with new info.?
Creative coders use backward thinking techniques as a strategy.

dedndave

well - don't know how you select which image to d/l
i didn't look at the code

but - google selects results based on location and past search history
i may get the same images as Jochen, but in a different order

dedndave

try creating an HTML page from the first 100 available images
<a> tags are pretty easy

i.e., rather than downloading,
just see what's available to help understand the selection issues

Siekmanski

Hi Dave,

CONTENT_LENGTH: is the total length of the html file from google.

Finding the urls by searching for imgurl=http: in the html file and check if it ends with .jpg
The image in my source code is the first one it finds in the html file, but there are many more in the html file.
Downloading the image was pure for checking if the found image-url works.
Next i'll code a routine that gathers all the urls and put them in a list from where i can choose one.
Creative coders use backward thinking techniques as a strategy.

dedndave


Siekmanski

Now it finds all jpg urls and checks the length of the url ( no longer then 259 bytes + trailing 0 )
And put them in an image list with all the addresses to the url strings.
Some of the strings look like this:

http://www.supermusic.sk/obrazky/2585635_P%252520R%252520Brown%2525202011.jpg

I'll work on a routine to convert those to plain ascii text.
At the bottom of the source is a routine to save one of the images found by image number. ( remove semicolons )

edit: new attachment, added maximum of 128 images to prevent buffer overflow and removed 2 lines of unused code.
Creative coders use backward thinking techniques as a strategy.

dedndave

when you write the conversion routine, you might want to support something like the following
%2520
that's a tricky one, because "%25" is "%"   ;)
so, "%2520" is a space - normally, you'd see it as "%20"
i have seen that in URL's, before

jj2007


Siekmanski

Url decoding routine done.  :biggrin:

% == 25 hex
example:
%252520 is encoded 3 times and represent a space character ( 20 == hex 20 == 32 dec == space )
%2520 is space is encoded 2 times
%20 is 1 time encoded

Decoding routine checks for multiple % and then calculates the value that follows.

http://i1223.photobucket.com/albums/dd517/jgwicked/Rammstein%252520Dec%25252011%2525202010/Rammstein1992.jpg
decoded: http://i1223.photobucket.com/albums/dd517/jgwicked/Rammstein Dec 11 2010/Rammstein1992.jpg
Creative coders use backward thinking techniques as a strategy.

Siekmanski

found an error in line 147

    cmp     edx,260
    jz      url_to_long

change it to:

    cmp     edx,260
    je      url_to_long
Creative coders use backward thinking techniques as a strategy.