The MASM Forum

General => The Workshop => Topic started by: NoCforMe on June 27, 2022, 04:34:41 PM

Title: Re: Extract tables from web pages
Post by: NoCforMe on June 27, 2022, 04:34:41 PM
Pretty kewl. Any chance I can get a version that I can point at any web page to extract from?

(BTW, my antivirus, AVG, complained about it, said they'd want to submit it to their lab for examination: should only take a day or two ...)
Title: Re: Re: Extract tables from web pages
Post by: jj2007 on June 28, 2022, 01:09:24 AM
Quote from: NoCforMe on June 27, 2022, 04:34:41 PM
Pretty kewl. Any chance I can get a version that I can point at any web page to extract from?

Here it is:

include \masm32\MasmBasic\MasmBasic.inc
  Init
  Let edi=Clip$() ; get URL from clipboard
  Let edi=FileRead$(edi) ; get content
  FileWrite "test.tab", NoTag$(edi) ; strip tags from content, and write it to file
  ShEx "test.tab" ; launch the application associated with *.tab (often Excel)
EndOfCode


Quote(BTW, my antivirus, AVG, complained about it, said they'd want to submit it to their lab for examination: should only take a day or two ...)

They'll examine it and conclude that, OMG!!!, it can download files from the Internet :cool:
Title: Re: Extract tables from web pages
Post by: hutch-- on June 28, 2022, 02:10:55 AM
Guys,

I moved this as the Showcase is for finished projects, no discussion.
Title: Re: Re: Extract tables from web pages
Post by: NoCforMe on June 28, 2022, 04:47:02 AM
Quote from: jj2007 on June 28, 2022, 01:09:24 AM
Quote from: NoCforMe on June 27, 2022, 04:34:41 PM
Pretty kewl. Any chance I can get a version that I can point at any web page to extract from?

Here it is:

include \masm32\MasmBasic\MasmBasic.inc
  Init
  Let edi=Clip$() ; get URL from clipboard
  Let edi=FileRead$(edi) ; get content
  FileWrite "test.tab", NoTag$(edi) ; strip tags from content, and write it to file
  ShEx "test.tab" ; launch the application associated with *.tab (often Excel)
EndOfCode

So, for those of us who don't use MasmBasic**, what does FileWrite "test.tab", NoTag$(edi) do that we can code ourselves? By "strip tags", you mean HTML tags, right? I know how to do that. But how does it format the text in a way that Excel can separate it onto cells? or does it not do that, and Excel is smart enough to do that?

** Yeah, yeah, I know what an amazing thing it is. Just not about to change my programming habits. Hell, I don't even like to use macros ...
Title: Re: Extract tables from web pages
Post by: NoCforMe on June 28, 2022, 04:51:45 AM
Quote from: hutch-- on June 28, 2022, 02:10:55 AM
Guys,

I moved this as the Showcase is for finished projects, no discussion.
So nobody is allowed to comment on any finished projects here? Interesting ...
Title: Re: Re: Extract tables from web pages
Post by: jj2007 on June 28, 2022, 05:07:59 AM
Quote from: NoCforMe on June 28, 2022, 04:47:02 AMSo, for those of us who don't use MasmBasic**, what does FileWrite "test.tab", NoTag$(edi) do that we can code ourselves? By "strip tags", you mean HTML tags, right? I know how to do that. But how does it format the text in a way that Excel can separate it onto cells?

Well.... if you insist not to use third party libraries like MasmBasic (but what about the CRT, or the Masm32 lib?), then you are up to some really serious work:
- download the HTML file (->WinInet)
- open it in an editor
- find out how tables are coded
- find out what exactly you have to extract, and how.

The formatting text part is by far the simplest step: insert a tab character.
Title: Re: Extract tables from web pages
Post by: NoCforMe on June 28, 2022, 05:28:05 AM
So stripping out HTML I get. But inserting tab characters: where? between HTML tags, like so?

<td> ... table text ... </td> [TAB] <td> ... more table text ... </td>


And how does your library work? I would think that FileWrite() just writes a file as-is, yes?. What exactly does the NoTag$ modifier do?

Hmm; much more of this and I guess I'll have to actually try out your "product". But here's the thing: what you've done here kind of takes all the fun out of it for me, since your method is so "canned": feed it a web page and it spits out a text stream that Excel can make into a spreadsheet. I like to know what goes on inside that black box, or preferably code it myself.

Hell, otherwise I'd just go to one of those Javascript web pages that reformats stuff for you ...
Title: Re: Extract tables from web pages
Post by: jj2007 on June 28, 2022, 05:37:14 AM
Quote from: NoCforMe on June 28, 2022, 05:28:05 AMAnd how does your library work? I would think that FileWrite() just writes a file as-is, yes?. What exactly does the NoTag$ modifier do?

https://www.jj2007.eu/MasmBasicQuickReference.htm#Mb1077
Title: Re: Extract tables from web pages
Post by: NoCforMe on June 28, 2022, 05:41:04 AM
OK, but that doesn't say anything about putting in tabs:
Quotestrips HTML tags, scripts and styles; don't expect miracles - reducing a perfectly styled webpage to pure text will not look pretty, but it's handy to filter webpages by text content
Title: Re: Extract tables from web pages
Post by: jj2007 on June 28, 2022, 06:15:16 AM
Believe me, inserting the tabs is by far the most trivial part of the exercise. Check the lodsb and stosb instructions.
Title: Re: Extract tables from web pages
Post by: NoCforMe on June 28, 2022, 07:26:27 AM
But but but ... there's no LODSB nor STOSB in that code you posted. Where are they?
Title: Re: Extract tables from web pages
Post by: jj2007 on June 28, 2022, 09:44:25 AM
You want to roll your own, so do it. Start with the NoTag$() stuff, it's only about 320 lines of assembly code.
Title: Re: Extract tables from web pages
Post by: hutch-- on June 28, 2022, 09:45:50 AM
> So nobody is allowed to comment on any finished projects here? Interesting ...

No, it means what the subforum is designed for, a showcase for finished projects that should not be subject to graffiti. The workshop is where analysis and other comments are encouraged. I honestly get tired of having to clean up sloppy posts plastered all over the place. The reason why you can find anything is because I clean it up on a regular basis.
Title: Re: Extract tables from web pages
Post by: NoCforMe on June 28, 2022, 11:38:02 AM
Quote from: jj2007 on June 28, 2022, 09:44:25 AM
You want to roll your own, so do it. Start with the NoTag$() stuff, it's only about 320 lines of assembly code.

OK, so where is the source? I'm not familiar with how your vast information repositories are arranged.
Title: Re: Extract tables from web pages
Post by: jj2007 on June 28, 2022, 06:59:32 PM
NoTag$() is in \Masm32\MasmBasic\MasmBasic.inc