Getting html content of google search

jj2007 · December 04, 2013, 04:37:05 AM

Quote from: Siekmanski on December 04, 2013, 04:09:43 AMEdit: better to read .asm file in new attachment.

Hmmm... hidden payload? ;)

Output:

Internet server

Satus Code: 200 OK

CONTENT_LENGTH: 527516

http://weirdestband.files.wordpress.com/2011/11/rammstein.jpg

Saving Rammstein.jpg ....

It takes a while, though... ca. 30 seconds or so.

dedndave · December 04, 2013, 04:41:28 AM

works here, Marinus :t
XP SP3
maybe 20 seconds - didn't time it - lol
but, it's a big image

don't know where Jochen got 527516
the one i got was a little over 4 MB

Siekmanski · December 04, 2013, 04:48:14 AM

QuoteHmmm... hidden payload? ;)

No, just forgot to change tabs to spaces to make the source code more readable.

30 seconds, that's a long long time.
I'll rewrite the code to search and load every 1024 bytes at a time, that should speed things up.

GoneFishing · December 04, 2013, 04:49:12 AM

Win8 32 bit : works OK

Quote
Internet server

Satus Code: 200 OK

CONTENT_LENGTH: 526874

http://weirdestband.files.wordpress.com/2011/11/rammstein.jpg

Saving Rammstein.jpg ....

Press any key to continue...

The image is the same as in Jochen's post but the CONTENT_LENGTH DIFFERS

Siekmanski · December 04, 2013, 04:57:08 AM

Thanks guys

4MB that's also part of the long time i guess, but you can search for smaller images if you like.

(TBM=isch)

When you search for images, TBM=isch, you can also use the following TBS values:

•Large images: tbs=isz:l
•Medium images: tbs=isz:m
•Icon sized images: tba=isz:i
•Image size larger than 400×300: tbs=isz:lt,islt:qsvga
•Image size larger than 640×480: tbs=isz:lt,islt:vga
•Image size larger than 800×600: tbs=isz:lt,islt:svga
•Image size larger than 1024×768: tbs=isz:lt,islt:xga
•Image size larger than 1600×1200: tbs=isz:lt,islt:2mp
•Image size larger than 2272×1704: tbs=isz:lt,islt:4mp
•Image sized exactly 1000×1000: tbs=isz:ex,iszw:1000,iszh:1000
•Images in full color: tbs=ic:color
•Images in black and white: tbs=ic:gray
•Images that are red: tbs=ic:specific,isc:red [orange, yellow, green, teal, blue, purple, pink, white, gray, black, brown]
•Image type Face: tbs=itp:face
•Image type Photo: tbs=itp:photo
•Image type Clipart: tbs=itp:clipart
•Image type Line drawing: tbs=itp:lineart
•Group images by subject: tbs=isg:to
•Show image sizes in search results: tbs=imgo:1

Example URL: Search in images for "michael jackson" as a phrase, and limit results to 4 megapixel images or larger, color images, face images, and group the results by topic:

http://www.google.com/search?q=%22michael+jackson%22&tbm=isch&tbs=ic:color,isz:lt,islt:4mp,itp:face,isg:to

Siekmanski · December 04, 2013, 05:00:39 AM

QuoteThe image is the same as in Jochen's post but the CONTENT_LENGTH DIFFERS

The content differs from time to time, maybe it's updated then with new info.?

dedndave · December 04, 2013, 05:07:19 AM

well - don't know how you select which image to d/l
i didn't look at the code

but - google selects results based on location and past search history
i may get the same images as Jochen, but in a different order

dedndave · December 04, 2013, 05:11:44 AM

try creating an HTML page from the first 100 available images
<a> tags are pretty easy

i.e., rather than downloading,
just see what's available to help understand the selection issues

Siekmanski · December 04, 2013, 07:54:59 AM

Hi Dave,

CONTENT_LENGTH: is the total length of the html file from google.

Finding the urls by searching for imgurl=http: in the html file and check if it ends with .jpg
The image in my source code is the first one it finds in the html file, but there are many more in the html file.
Downloading the image was pure for checking if the found image-url works.
Next i'll code a routine that gathers all the urls and put them in a list from where i can choose one.

dedndave · December 04, 2013, 08:26:54 AM

oh - gotcha :t

Siekmanski · December 04, 2013, 10:26:23 AM

Now it finds all jpg urls and checks the length of the url ( no longer then 259 bytes + trailing 0 )
And put them in an image list with all the addresses to the url strings.
Some of the strings look like this:

http://www.supermusic.sk/obrazky/2585635_P%252520R%252520Brown%2525202011.jpg

I'll work on a routine to convert those to plain ascii text.
At the bottom of the source is a routine to save one of the images found by image number. ( remove semicolons )

edit: new attachment, added maximum of 128 images to prevent buffer overflow and removed 2 lines of unused code.

dedndave · December 04, 2013, 11:04:53 AM

when you write the conversion routine, you might want to support something like the following

Code Select

%2520
that's a tricky one, because "%25" is "%" ;)
so, "%2520" is a space - normally, you'd see it as "%20"
i have seen that in URL's, before

jj2007 · December 04, 2013, 01:16:05 PM

Ever heard of/used OpenSSL?

Siekmanski · December 04, 2013, 01:54:16 PM

Url decoding routine done.

% == 25 hex
example:
%252520 is encoded 3 times and represent a space character ( 20 == hex 20 == 32 dec == space )
%2520 is space is encoded 2 times
%20 is 1 time encoded

Decoding routine checks for multiple % and then calculates the value that follows.

http://i1223.photobucket.com/albums/dd517/jgwicked/Rammstein%252520Dec%25252011%2525202010/Rammstein1992.jpg
decoded: http://i1223.photobucket.com/albums/dd517/jgwicked/Rammstein Dec 11 2010/Rammstein1992.jpg

Siekmanski · December 04, 2013, 02:33:24 PM

found an error in line 147

cmp edx,260
jz url_to_long

change it to:

cmp edx,260
je url_to_long

The MASM Forum

News:

Getting html content of google search

jj2007

dedndave

Siekmanski

GoneFishing

Siekmanski

Siekmanski

dedndave

dedndave

Siekmanski

dedndave

Siekmanski

dedndave

jj2007

Siekmanski

Siekmanski