News:

Masm32 SDK description, downloads and other helpful links
Message to All Guests

Main Menu

Re: Extract tables from web pages

Started by NoCforMe, June 27, 2022, 04:34:41 PM

Previous topic - Next topic

NoCforMe

Pretty kewl. Any chance I can get a version that I can point at any web page to extract from?

(BTW, my antivirus, AVG, complained about it, said they'd want to submit it to their lab for examination: should only take a day or two ...)
Assembly language programming should be fun. That's why I do it.

jj2007

Quote from: NoCforMe on June 27, 2022, 04:34:41 PM
Pretty kewl. Any chance I can get a version that I can point at any web page to extract from?

Here it is:

include \masm32\MasmBasic\MasmBasic.inc
  Init
  Let edi=Clip$() ; get URL from clipboard
  Let edi=FileRead$(edi) ; get content
  FileWrite "test.tab", NoTag$(edi) ; strip tags from content, and write it to file
  ShEx "test.tab" ; launch the application associated with *.tab (often Excel)
EndOfCode


Quote(BTW, my antivirus, AVG, complained about it, said they'd want to submit it to their lab for examination: should only take a day or two ...)

They'll examine it and conclude that, OMG!!!, it can download files from the Internet :cool:

hutch--

Guys,

I moved this as the Showcase is for finished projects, no discussion.

NoCforMe

Quote from: jj2007 on June 28, 2022, 01:09:24 AM
Quote from: NoCforMe on June 27, 2022, 04:34:41 PM
Pretty kewl. Any chance I can get a version that I can point at any web page to extract from?

Here it is:

include \masm32\MasmBasic\MasmBasic.inc
  Init
  Let edi=Clip$() ; get URL from clipboard
  Let edi=FileRead$(edi) ; get content
  FileWrite "test.tab", NoTag$(edi) ; strip tags from content, and write it to file
  ShEx "test.tab" ; launch the application associated with *.tab (often Excel)
EndOfCode

So, for those of us who don't use MasmBasic**, what does FileWrite "test.tab", NoTag$(edi) do that we can code ourselves? By "strip tags", you mean HTML tags, right? I know how to do that. But how does it format the text in a way that Excel can separate it onto cells? or does it not do that, and Excel is smart enough to do that?

** Yeah, yeah, I know what an amazing thing it is. Just not about to change my programming habits. Hell, I don't even like to use macros ...
Assembly language programming should be fun. That's why I do it.

NoCforMe

Quote from: hutch-- on June 28, 2022, 02:10:55 AM
Guys,

I moved this as the Showcase is for finished projects, no discussion.
So nobody is allowed to comment on any finished projects here? Interesting ...
Assembly language programming should be fun. That's why I do it.

jj2007

Quote from: NoCforMe on June 28, 2022, 04:47:02 AMSo, for those of us who don't use MasmBasic**, what does FileWrite "test.tab", NoTag$(edi) do that we can code ourselves? By "strip tags", you mean HTML tags, right? I know how to do that. But how does it format the text in a way that Excel can separate it onto cells?

Well.... if you insist not to use third party libraries like MasmBasic (but what about the CRT, or the Masm32 lib?), then you are up to some really serious work:
- download the HTML file (->WinInet)
- open it in an editor
- find out how tables are coded
- find out what exactly you have to extract, and how.

The formatting text part is by far the simplest step: insert a tab character.

NoCforMe

So stripping out HTML I get. But inserting tab characters: where? between HTML tags, like so?

<td> ... table text ... </td> [TAB] <td> ... more table text ... </td>


And how does your library work? I would think that FileWrite() just writes a file as-is, yes?. What exactly does the NoTag$ modifier do?

Hmm; much more of this and I guess I'll have to actually try out your "product". But here's the thing: what you've done here kind of takes all the fun out of it for me, since your method is so "canned": feed it a web page and it spits out a text stream that Excel can make into a spreadsheet. I like to know what goes on inside that black box, or preferably code it myself.

Hell, otherwise I'd just go to one of those Javascript web pages that reformats stuff for you ...
Assembly language programming should be fun. That's why I do it.

jj2007

Quote from: NoCforMe on June 28, 2022, 05:28:05 AMAnd how does your library work? I would think that FileWrite() just writes a file as-is, yes?. What exactly does the NoTag$ modifier do?

https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1077

NoCforMe

OK, but that doesn't say anything about putting in tabs:
Quotestrips HTML tags, scripts and styles; don't expect miracles - reducing a perfectly styled webpage to pure text will not look pretty, but it's handy to filter webpages by text content
Assembly language programming should be fun. That's why I do it.

jj2007

Believe me, inserting the tabs is by far the most trivial part of the exercise. Check the lodsb and stosb instructions.

NoCforMe

But but but ... there's no LODSB nor STOSB in that code you posted. Where are they?
Assembly language programming should be fun. That's why I do it.

jj2007

You want to roll your own, so do it. Start with the NoTag$() stuff, it's only about 320 lines of assembly code.

hutch--

> So nobody is allowed to comment on any finished projects here? Interesting ...

No, it means what the subforum is designed for, a showcase for finished projects that should not be subject to graffiti. The workshop is where analysis and other comments are encouraged. I honestly get tired of having to clean up sloppy posts plastered all over the place. The reason why you can find anything is because I clean it up on a regular basis.

NoCforMe

Quote from: jj2007 on June 28, 2022, 09:44:25 AM
You want to roll your own, so do it. Start with the NoTag$() stuff, it's only about 320 lines of assembly code.

OK, so where is the source? I'm not familiar with how your vast information repositories are arranged.
Assembly language programming should be fun. That's why I do it.

jj2007

NoTag$() is in \Masm32\MasmBasic\MasmBasic.inc