Print Page - moving thread from pb forums to here because of their server problems

Title: moving thread from pb forums to here because of their server problems
Post by: bobl on August 27, 2015, 11:12:39 PM

I just wanted to say that in addition to each token's x and y values I might also need the width and height values (along with the rotation figure to discard e.g. vertical text). That's because the token text snippets that make up the same line don't necessarily have the same y value (which would make grouping them trivial). I'm currently looking at the "base" value to see if I can use it as a proxy for y.

It might be worth holding fire re an asm-based extraction solution until I can verify that
a) those values are what I think they are and
b) I can come up with a fool-proof algorithm to reliably make up the lines.

I'll come back when I've got something and thank you for your patience.
Regards
Dean

Title: Re: moving thread from pb forums to here because of their server problems
Post by: bobl on August 28, 2015, 08:20:45 AM

From my little tests the "base" value seems a perfect alternative to y so it looks like I can get away with extracting...
1 the PAGE NUMBER generated by incrementing a variable every time <PAGE is encountered and...
for each token, i.e. the string between <TOKEN and </TOKEN>,
the values for...
ROTATION...i.e. don't record anything if it's not zero...
otherwise, if it is...
the values for...
X value
BASE value
TEXT value i.e. the string between > and </TOKEN at the very end of the token

the same-ness of the base figure means I can avoid the fag-paper thin calculations I was once doing because phrases are either on the same line or they're not so....x and base can now probably be singles instead of doubles. The values seem to go to 4 decimal places.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: hutch-- on August 28, 2015, 11:04:21 AM

Dean,

I got you PDF file and at least know what the source was, is there any reason to convert the PDF to XML ? The conversion looks like it is both messy and lossy which is generating some of the problems parsing the XML file.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: bobl on August 28, 2015, 05:44:32 PM

Hutch
That's a very good question.

The only reason I converted from pdf to xml is that I read that you can't extract the data in a comprehensible form from pdfs without first decoding, which I don't have a clue about. If that's not true, and you can get the above fields directly, I'd much prefer that, especially considering that the conversion isn't exactly instantaneous and that the 3rd party software I am relying on is lossy.

Re loss...I did try a number of of pdf-to-txt variants but these were even more lossy than the pdf-to-xml route, particularly in terms of structure.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: hutch-- on August 28, 2015, 06:17:55 PM

I don't think it would be viabe to try and directly parse PDF but I wondered if there is such a thing as a decent PDF to TEXT converter ? My Nitro Reader won't do it, locks up instead and I don't keep the Adobe version any longer due to the endless upgrades.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: dedndave on August 28, 2015, 08:17:41 PM

for single-use, there are likely online converters

from what i remember, there is a library package for working with PDF's
if you can figure out how it plays, shouldn't be hard to write a little app :P

Title: Re: moving thread from pb forums to here because of their server problems
Post by: jj2007 on August 28, 2015, 09:15:13 PM

Quote from: dedndave on August 28, 2015, 08:17:41 PMshouldn't be hard to write a little app :P

PDF format specs are public nowadays. (https://en.wikipedia.org/wiki/Portable_Document_Format#Adobe_specifications) Have fun...

Title: Re: moving thread from pb forums to here because of their server problems
Post by: bobl on August 28, 2015, 10:23:32 PM

It's worth pointing out that my comments re pdf-to-text offerings being lossy only applies to the free ones.
I looked at Debenu's Quick Pdf library (Dave might be the one you're thinking of) which gets good reviews from Powerbasic users...
http://www.powerbasic.com/support/pbforums/showthread.php?t=54013.
It's $499. If it works flawlessly and the cheaper ones below don't...it might be worth that.

I did find this today but am not sure if the conversion is pdf-to-text or the other way around
http://convert-pdf-software-review.toptenreviews.com.
Most of them look reasonably priced and therefore worth investigating.
Other than two they all seem to do batch conversion which for 30,000 files would be a must-have.
I use pdf xchange viewer and know they have a pro version so might talk to them too.

Adobe stuff always seems too big cumbersome to me + the constant updates, Hutch mentioned, would irritate me too.

Hutch
Thanks for the searching question which has prompted me to revisit the conversion tools.
Thank you.

Dave and jj2007
Thanks for your responses.
In truth...I'd love to get in there and write my own...but it would take me ages and I've got to finish this app asap. When it's running I'll have the time to "polish" it by doing work like this.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: jj2007 on August 28, 2015, 10:57:26 PM

Quote from: bobl on August 28, 2015, 10:23:32 PM
It's worth pointing out that my comments re pdf-to-text offerings being lossy only applies to the free ones.
I looked at Debenu's Quick Pdf library (Dave might be the one you're thinking of) which gets good reviews from Powerbasic users...
http://www.powerbasic.com/support/pbforums/showthread.php?t=54013.
It's $499. If it works flawlessly and the cheaper ones below don't...it might be worth that.

I once needed this, and used pdf2txt. It worked, but no miracles. The PDF format is very tricky.
Have a look at http://software.informer.com/search/pdf+text+converter
Judging by the number of downloads could be a starting point. For example, 170,563 downloads for
"All File to All File Converter 3000", which is shareware but costs 160$ ::)

If your files are similar in format, you might find a cheap or free solution that handles well your specific type of files. Problem are "complicated" documents created with the latest Adobe format.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: bobl on August 28, 2015, 11:24:45 PM

Thanks for the information
pdf2txt was what I used and it lost some of the structure.
I just came back to say I've just pointed http://www.somepdf.com/some-pdf-to-txt-converter.html at my parent pdf directory, the sub-directories of which hold the actual pdfs, and whilst it crashed a little while after reporting that one of my pdfs was not open to being converted it seemed to convert a number of the pdfs quite well and quite quickly i.e. in batch mode.
Shame about the crash.

>Problem are "complicated" documents created with the latest Adobe format.
I'm guessing a lot of finance directors will want to have the latest Adobe tools.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: bobl on August 29, 2015, 12:21:54 AM

"The PDF format can encode text either as ASCII values with a font applied, or it can encode it as a bitmap. If the tool that created your PDF decided to encode the special characters as a bitmap, you will be out of luck (unless you want to get into OCR solutions, of course)."

"It worse than this - text need not be laid out on the page in reading order. It need not be laid out rectilinearly.
Pasted from <http://stackoverflow.com/questions/1136990/how-can-i-extract-text-from-a-pdf-file-in-perl> "

That explains a lot!

A glimmer of hope...re ...to text
"
PDF2TXT.py This is what I use, although it is Python, it works flawlessly.
http://www.unixuser.org/~euske/python/pdfminer/index.html
Pasted from <http://stackoverflow.com/questions/1136990/how-can-i-extract-text-from-a-pdf-file-in-perl>
"
"
Yes, pdf2txt.py runs flawlessly ! – mandy Jul 9 '11 at 11:16
Pasted from <http://stackoverflow.com/questions/1136990/how-can-i-extract-text-from-a-pdf-file-in-perl>
"
I installed and ran pdf2txt.py and got
===========================================================
51

CONSOLIDATED INCOME STATEMENT
FOR THE YEAR ENDED 31ST DECEMBER

Revenue
Net operating costs
Group operating proï¬t
===========================================================
i.e. the same corrupt representation of "profit" that pdf2xml is giving so...not flawless...not without possibly some flag setting. By contrast "Some pdf" converts "profit" fine. Shame it keels over when encountering problematic or protected pdfs.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: bobl on August 29, 2015, 05:27:03 AM

I've been looking at various tools today and have just downloaded and had a go with ghostscript.
Playing with some of the flags and ending up with...

gswin32c -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dCOMPLEX -c save -f ps2ascii.ps c:\2013.pdf -c quit >c:\2013.txt

...it's just produced the attached file which certainly looks very "uniform".
I'm just in the process of assembling it into lines to see how faithfully it reflects the original.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: jj2007 on August 29, 2015, 06:03:06 AM

Looks horrible :biggrin:

What is it? A database? Without knowing the original pdf, it's difficult to understand. What do you want to do with it, extract the data, or translate to a spreadsheet??

Title: Re: moving thread from pb forums to here because of their server problems
Post by: hutch-- on August 29, 2015, 11:18:32 AM

Dean,

That looks like it can be parsed OK. Just grab the character(s) inside the brackets, the "(\243)" looks like the line breaks.

LATER:

This is the first try and the ghostscript format looks like it is consistent to parse.

Code Select


' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    #include "\basic\include\win32api.inc"

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

FUNCTION PBmain as LONG

    LOCAL lft as DWORD
    LOCAL rgt as DWORD

    editor$ = "\basic\qeditor.exe"

    Open "ghostscript.txt" for Input as #1
    Open "output.txt" for Output as #2

  ' ***************
  ' line processing
  ' ***************
    Do
      Line Input #1, a$

      If instr(a$,"\243") <> 0 Then             ' line break character
        Print #2, chr$(13,10);
        ! jmp outlbl
      End if

      Select Case as CONST$ left$(a$,1)
        Case "S"                                ' normal text character
          lft = instr(a$,"(")+1
          rgt = instr(a$,")")
          Print #2, mid$(a$,lft,rgt-lft);

        Case "P"
          Print #2, "--------"+chr$(13,10);     ' show page breaks

        Case "F"
          Print #2, chr$(13,10)+a$              ' font line

        Case "C"
          Print #2, chr$(13,10)+a$

        Case "R"
          Print #2, chr$(13,10)+a$

        Case Else
          Print #2, a$                          ' default

      End Select

    outlbl:

    Loop while not eof(1)

    Close #2
    Close #1

  ' *****************
  ' global processing
  ' *****************
    src$ = load_file("output.txt")

    src$ = block_ltrim(src$)
    Replace chr$(13,10,13,10) with chr$(13,10) in src$
    save_file("output.txt",src$)

  ' *****************

    a& = shell(editor$+" output.txt",1)

End FUNCTION

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    DECLARE FUNCTION open_file_A LIB "KERNEL32.DLL" ALIAS "CreateFileA" ( _
                     ByVal lpFileName AS DWORD, _
                     ByVal dwDesiredAccess AS DWORD, _
                     ByVal dwShareMode AS DWORD, _
                     ByVal lpSecurityAttributes AS DWORD, _
                     ByVal dwCreationDisposition AS DWORD, _
                     ByVal dwFlagsAndAttributes AS DWORD, _
                     BYVAL hTemplateFile AS DWORD) AS DWORD

    DECLARE FUNCTION get_file_size LIB "KERNEL32.DLL" ALIAS "GetFileSize" ( _
                     BYVAL hFile AS DWORD, lpFileSizeHigh AS LONG) AS LONG

    DECLARE FUNCTION file__read LIB "KERNEL32.DLL" ALIAS "ReadFile" ( _
                     BYVAL hFile AS DWORD,ByVal pbuff as DWORD, BYVAL nNumberOfBytesToRead AS DWORD, _
                     ByVal lpNumberOfBytesRead AS DWORD, ByVal lpOverlapped AS DWORD) AS LONG

    DECLARE FUNCTION closefh LIB "KERNEL32.DLL" ALIAS "CloseHandle" ( _
                     BYVAL hObject AS DWORD) AS LONG

' ------------------------------------

FUNCTION load_file(fname$) as STRING

    LOCAL hFile as DWORD
    LOCAL flen  as DWORD
    LOCAL pdat  as DWORD         ' string pointer
    LOCAL bred  as DWORD         ' bytes read variable

    hFile = open_file_A(StrPtr(fname$),&H80000000& or &H40000000&,0,0,3,&H00000080,0)
    flen  = get_file_size(hFile,0)

    buffer$ = nul$(flen)
    pdat = StrPtr(buffer$)

    file__read(hFile,pdat,flen,VarPtr(bred),0)

    closefh hFile

    FUNCTION = buffer$

End FUNCTION

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    DECLARE FUNCTION fcreate_A LIB "KERNEL32.DLL" ALIAS "CreateFileA" ( _
                     ByVal lpFileName AS DWORD, _
                     ByVal dwDesiredAccess AS DWORD, _
                     ByVal dwShareMode AS DWORD, _
                     ByVal lpSecurityAttributes AS DWORD, _
                     ByVal dwCreationDisposition AS DWORD, _
                     ByVal dwFlagsAndAttributes AS DWORD, _
                     BYVAL hTemplateFile AS DWORD) AS DWORD

    DECLARE FUNCTION file__write LIB "KERNEL32.DLL" ALIAS "WriteFile" ( _
                     BYVAL hFile AS DWORD,ByVal lpBuffer AS DWORD, _
                     BYVAL nNumberOfBytesToWrite AS DWORD, _
                     ByVal NumberOfBytesWritten AS DWORD,ByVal lpOverlapped AS DWORD) AS DWORD

    DECLARE FUNCTION fh_close LIB "KERNEL32.DLL" ALIAS "CloseHandle" ( _
                     BYVAL hObject AS DWORD) AS LONG

' ------------------------------------------

FUNCTION save_file(fname$,src$) as DWORD

    LOCAL hFile as DWORD
    LOCAL pdat as DWORD         ' string pointer
    LOCAl ldat as DWORD         ' data length
    LOCAL bwrt as DWORD         ' bytes written variable

    hFile = fcreate_A(StrPtr(fname$),&H40000000&,0,0,2,&H00000080,0)

    pdat = StrPtr(src$)         ' get string address
    ! mov eax, pdat
    ! mov eax, [eax-4]          ' get string length
    ! mov ldat, eax

    file__write(hFile,pdat,ldat,VarPtr(bwrt),0)

    fh_close hFile

    FUNCTION = bwrt

End FUNCTION

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

FUNCTION block_ltrim(txt$) as STRING

  ' ----------------------------------------------
  ' trim leading tabs and spaces on multiple lines
  ' of source and overwrite it with the results.
  ' ----------------------------------------------

    #REGISTER NONE

    LOCAL src as DWORD

    src = StrPtr(txt$)

    ! mov edx, src              ' source in EDX
    ! mov ecx, src              ' same address as target
    ! sub edx, 1

  #align 4
  trimleft:
    ! add edx, 1
    ! cmp BYTE PTR [edx], 32    ' loop back on space
    ! je trimleft
    ! cmp BYTE PTR [edx], 9     ' loop back on tab
    ! je trimleft
    ! sub edx, 1

  #align 4
  store:
    ! add edx, 1
    ! movzx eax, BYTE PTR [edx] ' copy byte
    ! mov [ecx], al
    ! add ecx, 1
    ! test al, al               ' test for written terminator
    ! jz bl_out                 ' exit on terminator
    ! sub al, 10                ' test for ascii 10
    ! jnz store
    ! jmp trimleft

  bl_out:
    ! sub ecx, src              ' sub src from ecx
    ! mov src, ecx              ' write ecx back to src (src reuse)
    FUNCTION = left$(txt$,src)  ' return basic string

END FUNCTION

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

Title: Re: moving thread from pb forums to here because of their server problems
Post by: hutch-- on August 29, 2015, 06:47:00 PM

i eventually got Nitro reader to convert the original PDF to text and it is a lot cleaner than the conversions we have tried so far. What makes it messy is it contains control codes that I have yet to find a set of conversions for. You can guess at some of them and clean up the results a bit but they look like old octal notation but don't all work the same way. I could directly replace the UK pound symbol but at least one of the control codes is 3 characters that convert to the characters "fi" so I don't yet know how to properly convert the data.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: bobl on August 29, 2015, 07:09:58 PM

jj
the orginal is a pdf but it's too big to post here so I've attached a screen shot of a page
Hutch
Thanks for the the "(\243)" line break advice.
Last night it took me a few goes to break the file....with $LF.
I'm just seeing \243 now.
Re those control codes...There's a lot more pitfalls in doing this than I had expected.
Thank you very much indeed for your work.
I'm still parsing at the moment.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: jj2007 on August 29, 2015, 08:00:12 PM

If the screenshot would correspond to a text file, we would have a chance to understand it. Could you post the textfile that was converted from this pdf?

Title: Re: moving thread from pb forums to here because of their server problems
Post by: hutch-- on August 29, 2015, 08:37:26 PM

Dean,

What I am interested in is what do you need to get out of the PDF file ? There is a lot of information in the PDF file, image data, tables and text data and a lot of formatting as well.

From the original PDF file converted to text with Nitro Reader I can get this type of data with some replacements and further massaging.

Code Select


52
Group plc
Annual Report and Accounts 2013
CONSOLIDATED STATEMENT OF COMPREHENSIVE INCOME
FOR THE YEAR ENDED 31ST DECEMBER
Notes
2013
£m
2012
Restated
(note 6)
£m
Profit for the year 0.7 1.5
Other comprehensive income/(losses) to be reclassified to profit or loss in subsequent periods:
Exchange adjustments on hedge of net investments 19 0.6
Exchange differences on translation of foreign operations 0.4 (2.8) Tax on items taken to reserves (0.4)
Net other comprehensive income/(losses) to be reclassified to profit or loss in
subsequent periods (2.2)
Items not to be reclassified to profit or loss in subsequent periods:
Actuarial gains/(losses) on defined benefit pension schemes 6 2.3 (32.7) Tax on items taken to reserves 8 (2.3) 7.4
Net other comprehensive losses not to be reclassified to profit or loss in subsequent
periods (25.3)
Other comprehensive losses for the year (27.5)
Total comprehensive income/(losses) for the year 0.7 (26.0)
Attributable to:
Equity holders of the parent 0.8 (25.9) Non-controlling interests (0.1) (0.1)
Total comprehensive income/(losses) for the year 0.7 (26.0)
Notes to the accounts are on pages 57 to 82.

PS: The notation \243 is the UK pound symbol.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: bobl on August 29, 2015, 10:06:57 PM

I just came back to ask if you've discovered what the page delimiter is i.e. I've got some output but characters are missing. I foolishly tried the "P" on it's own line.

JJ
I was intending to post the converted page to go with that pdf page and that's what I'm trying to find by parsing the file.

Hutch
Yes that a very good question.

The short answer is the lines in jj's screen shot above your post for the income statement, balance sheet and cashflow statement....and preferably the consolidated versions of these statements, if present.
Ideally the lines start with a textual title and end in two numbers...this years and last years.
The statements are in very close proximity i.e. often a page each and consecutive.

Here's an example
Revenue 1111.11 2222.22
or
Revenue 3 1111.11 2222.22

and I'd like to create the following table.
revenue | 1111.11 | 2222.22
cogs | 33.11 | 44.22

Unfortunately, it's not uncommon for the items to digress from this ideal format (see the balance sheet) and automatically correcting them will take another level of parsing.

At this stage getting the right pages (or all of them) with the lines intact would be great.

Hope that explains and thank you for your help...if not please let me know.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: bobl on August 29, 2015, 11:20:07 PM

I read in some university's pdf to text tutorial that the ps2ascii uitlity that comes with ghostscript is not very robust. By contrast ghostscript was deemed "wonderful". Trying to eliminate the former I tried tried this....

gswin32c -dNOPAUSE -sDEVICE=txtwrite -dFirstPage=54 -dLastPage=57 -sOutputFile=output.txt -q 2013.pdf -c quit

and got the attached file of the "statements of interest".

I also looked into the "corrupted" "fi" in "profit" and found the following explanation...

"
fi and fl are character codes 174 and 175 in Adobe StandardEncoding. Character codes 174 and 175 are "registered" and "macron" in ISO Latin 1 and encodings derived therefrom (such as Windows ANSI, LY1 etc.). So somehow you have old TFMs lying around set up for Adobe StandardEncoding.
Pasted from <http://www.verycomputer.com/18_7c47d859451ce7d4_1.htm>
"

Title: Re: moving thread from pb forums to here because of their server problems
Post by: jj2007 on August 29, 2015, 11:35:18 PM

What is this?
Current tax assets â€" 1.1 1.0

What is the difference - they both seem "fi":
ProÂ"t before tax
..
Net ï¬nance costs

I've given it a first shot (attached). It looks almost convincing in Excel. One problem is that sometimes there is a note, sometimes not (e.g. Profit before tax):

Code Select

Revenue	3	250.4	244.6
Net operating costs	4	(242.2)	(238.1)
Group operating profit	4	8.2	6.5
Pension charge	6	(3.5)	(2.9)
Non-recurring costs	4	(2.2)	(1.7)
Profit before finance income/(costs) and tax	2.5	1.9
Finance income	7	0.1	0.4
Finance costs	7	(1.5)	(0.6)
Profit before tax	1.1	1.7
Tax expense	8	(0.4)	(0.2)
Profit for the year	0.7	1.5
Profit attributable to:
Equity holders of the parent	0.8	1.6
Non-controlling interests	(0.1)	(0.1)
Profit for the year	0.7	1.5

Source:

Code Select

include \masm32\MasmBasic\MasmBasic.inc
  Init
  Let esi=CL$()			; you may pass a textfile in the command line
  .if Len(esi)==1
	Let esi="Output.txt"
  .endif
  Let esi=FileRead$(esi)
  Let esi=Replace$(esi, "Â"", "fi")
  Let esi=Replace$(esi, "ï¬", "fi")
  Let esi=Replace$(esi, "ï¬,", "fl")
  Let esi=Replace$(esi, "Â£", "Ps")	; pound sterling
  Let esi=Replace$(esi, "â€"", "xxx€")	; Euros?
  mov ecx, Len(esi)
  Let edi=New$(ecx)
  push esi
  push edi
  add ecx, esi		; startpos+Len(esi), marks end of parsing on stack
  push ecx
  .Repeat
	lodsb
	.if al==32
		.if dword ptr [esi]==20202020h
			.Repeat
				lodsb
			.Until al!=32
			dec esi
			mov al, 9	; one tab
		.endif
	.elseif al==13	; CrLf
		stosb
		movsb
		.Repeat
			lodsb
		.Until al>32
	.endif
	stosb
  .Until esi>=stack
  pop ecx
  pop edi
  pop esi
  FileWrite "Output.tab", edi
  sub ecx, esi
  Print Str$("%i bytes processed", ecx)
  Exit
EndOfCode

Title: Re: moving thread from pb forums to here because of their server problems
Post by: bobl on August 30, 2015, 12:20:30 AM

JJ
I just came back to post this file cleaner....

Code Select


#COMPILE EXE
'#DIM ALL

#INCLUDE "win32api.inc"

FUNCTION clean_ln(ln$) AS STRING
   new_ln$ = ln$
   REPLACE "Â"" WITH "fi" IN new_ln$
   REPLACE "ï¬" WITH "fi" IN new_ln$
   REPLACE "Â£" WITH "£" IN new_ln$
   REPLACE "â€"" WITH "-" IN new_ln$
   REPLACE "ï¬," WITH "fl" IN new_ln$
   FUNCTION = new_ln$
END FUNCTION


FUNCTION PBMAIN () AS LONG
   editor$="c:\qedit35\qeditor"
    OPEN "output.txt" FOR INPUT AS #1
    OPEN "output2.txt" FOR OUTPUT AS #2
      WHILE ISFALSE(EOF(#1))
         LINE INPUT #1,ln$
         ln$=clean_ln(ln$)
         PRINT #2,ln$
      WEND
    CLOSE #2
    CLOSE #1
   a& = SHELL(editor$+" output2.txt",1)
END FUNCTION

....am reading your post now. Thank you for it.

Later...
Thanks for the masm code...I need that!

>sometimes there is a note
Well spotted...and yes it's optional...The way to deal with that and the title is to count inwards from both sides.
i.e.
for the title....
go in from the left move rightwards until you hit a number and keep going until you see a letter after it...If you don't then the character before that first number you hit is the end of the title.
for the numbers...
From the right move leftwards two non-space groupings and those are your..."this year's" and "last year's" figures. The problem is authors often add more than 2 numbers a lot more than I'd like which is tricky to deal with and will probably involve trying to identify the title fields and count the fields in from the right until you see just the current year on it's own i.e. 2015 and count that many non-space groups in. If I can't get the title column or it isn't plain "2015" then I'll have to suck the figures off manually. I do have a very good tool for this called Monarch pro but it doesn't like this 2013 pdf whereas everything else I have opens it fine. Probably 'cos it's a few years old now and as you pointed out, the specs change.

Thanks for your program. I ran it and lotus opened it fine (when I changed the extension to .txt)

Title: Re: moving thread from pb forums to here because of their server problems
Post by: jj2007 on August 30, 2015, 03:19:41 AM

Quote from: bobl on August 30, 2015, 12:20:30 AMThanks for your program. I ran it and lotus opened it fine (when I changed the extension to .txt)

My pleasure. I attach a new version that handles the columns correctly and opens output.tab in Excel if Excel is running.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: hutch-- on August 30, 2015, 03:31:01 AM

REPLACE "Â"" WITH "fi" IN new_ln$
REPLACE "ï¬" WITH "fi" IN new_ln$
REPLACE "Â£" WITH "£" IN new_ln$
REPLACE "â€"" WITH "-" IN new_ln$
REPLACE "ï¬," WITH "fl" IN new_ln$

This is the type of stuff we need to clean up the converted file. If we can get a full list of these on standard control codes, parsing the XML file would be a lot easier.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: jj2007 on August 30, 2015, 03:51:19 AM

What is the "â€""?
New version attached above, before Hutch' post.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: hutch-- on August 30, 2015, 04:37:13 AM

I think the "â€"" is the same a a bullet mark in word. A "-" seems to do the job and it reads OK.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: bobl on August 30, 2015, 07:56:50 PM

Thanks JJ and Hutch
I don't know if this is relevant
// Adobe Standard Encoding table for ttf2pt1
// Thomas Henlich <Thomas.Henlich@mailbox.tu-dresden.de>
Pasted from <http://get-software.net/fonts/utilities/ttf2pt1/maps/adobe-standard-encoding.map>

I've attached the "whole" file. Looking through it "ff" in efficiency is missing but in other places there are funny character sequences for it. In the attached program I just included the whole word for conversion.

One thing about the output...is pages. There aren't any :) . I managed to copy the textual output to one page per file (see %%d in the output file name....double percent so as not to conflict with DOS' %) which I accidently left in the attached bactch file.

I'm still working on how you just write page separators to the single output file.
The file size is now about 500k and not the 12-13mb Hutch originally questioned
i.e. causing this switch and which given the 24X reduction was a very good call.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: bobl on August 30, 2015, 09:09:25 PM

That's interesting....
http://ghostscript.com/doc/7.07/Use.htm#PDF_switches
Ghostscript always expects the first line of a pdf to start...%PDF e.g. %PDF-1.2.
Looking at 2013.pdf it's a %PDF 1.5 so that narrows down what we're looking re code conversions at a bit.
The advice also says if there's a problem converting you can ditch any rubbish before %PDF and it might then work i.e. Ghostscript expects this as a starting point whereas e.g. Adobe's viewer is more liberal.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: hutch-- on August 30, 2015, 09:28:02 PM

Hi Dean,

This is a much better result, I built your cleaner and ran it on the test file and it worked fine. Now the question is, is the resulting output suitable to parse out the data you need ?

LATER:

RE: The page numbers, does this properly identify the pages in the file. I have used your set of character replacements then parsed each line that contains "Annual Report and Accounts" reversing the page number when its aligned right so it is now left.

Code Select


' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    #include "\basic\include\win32api.inc"

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

FUNCTION PBmain as LONG

    editor$ = "\basic\qeditor.exe"

  ' -----------------
  ' GLOBAL operations
  ' -----------------
    src$ = load_file("2013.txt")
    Replace "Â"" with "fi" in src$
    Replace "ï¬" with "fi" in src$
    Replace "Â£" with "£" in src$
    Replace "â€"" with "-" in src$
    Replace "ï¬," with "fl" in src$
    Replace "â€™" with "'" in src$
    Replace "â€˜" with "'" in src$
    Replace "" with "ff" in src$
    Replace "eicency" with "efficiency" in src$
    save_file("temp.txt",src$)

  ' ----------------------
  ' single line operations
  ' ----------------------
    Open "temp.txt" for Input as #1
    Open "cleaned.txt" for Output as #2

    Do
      Line Input #1, a$

      If instr(a$,"Annual Report and Accounts") <> 0 Then
        a$ = remove$(a$,"2013")
        a$ = monospace$(a$)

      ' -------------------------------------
      ' get page number from right sided page
      ' -------------------------------------
        If left$(a$,1) = "A" Then
          numb$ = ltrim$(right$(a$,2))
          a$ = "Page " + numb$ + " Annual Report and Accounts"
          Print #2, a$
          ! jmp bypass
        End If

      ' -------------------------------------
      ' output the left side page number line
      ' -------------------------------------
        a$ = "Page " + a$
        Print #2, a$
        ! jmp bypass

      End If

    ' ---------------------
    ' print any other lines
    ' ---------------------
      Print #2, a$

    bypass:

    Loop while not eof(1)

    Close #2
    Close #1

    x& = shell(editor$+" cleaned.txt",1)

End FUNCTION

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    DECLARE FUNCTION open_file_A LIB "KERNEL32.DLL" ALIAS "CreateFileA" ( _
                     ByVal lpFileName AS DWORD, _
                     ByVal dwDesiredAccess AS DWORD, _
                     ByVal dwShareMode AS DWORD, _
                     ByVal lpSecurityAttributes AS DWORD, _
                     ByVal dwCreationDisposition AS DWORD, _
                     ByVal dwFlagsAndAttributes AS DWORD, _
                     BYVAL hTemplateFile AS DWORD) AS DWORD

    DECLARE FUNCTION get_file_size LIB "KERNEL32.DLL" ALIAS "GetFileSize" ( _
                     BYVAL hFile AS DWORD, lpFileSizeHigh AS LONG) AS LONG

    DECLARE FUNCTION file__read LIB "KERNEL32.DLL" ALIAS "ReadFile" ( _
                     BYVAL hFile AS DWORD,ByVal pbuff as DWORD, BYVAL nNumberOfBytesToRead AS DWORD, _
                     ByVal lpNumberOfBytesRead AS DWORD, ByVal lpOverlapped AS DWORD) AS LONG

    DECLARE FUNCTION closefh LIB "KERNEL32.DLL" ALIAS "CloseHandle" ( _
                     BYVAL hObject AS DWORD) AS LONG

' ------------------------------------

FUNCTION load_file(fname$) as STRING

    LOCAL hFile as DWORD
    LOCAL flen  as DWORD
    LOCAL pdat  as DWORD         ' string pointer
    LOCAL bred  as DWORD         ' bytes read variable

    hFile = open_file_A(StrPtr(fname$),&H80000000& or &H40000000&,0,0,3,&H00000080,0)
    flen  = get_file_size(hFile,0)

    buffer$ = nul$(flen)
    pdat = StrPtr(buffer$)

    file__read(hFile,pdat,flen,VarPtr(bred),0)

    closefh hFile

    FUNCTION = buffer$

End FUNCTION

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    DECLARE FUNCTION fcreate_A LIB "KERNEL32.DLL" ALIAS "CreateFileA" ( _
                     ByVal lpFileName AS DWORD, _
                     ByVal dwDesiredAccess AS DWORD, _
                     ByVal dwShareMode AS DWORD, _
                     ByVal lpSecurityAttributes AS DWORD, _
                     ByVal dwCreationDisposition AS DWORD, _
                     ByVal dwFlagsAndAttributes AS DWORD, _
                     BYVAL hTemplateFile AS DWORD) AS DWORD

    DECLARE FUNCTION file__write LIB "KERNEL32.DLL" ALIAS "WriteFile" ( _
                     BYVAL hFile AS DWORD,ByVal lpBuffer AS DWORD, _
                     BYVAL nNumberOfBytesToWrite AS DWORD, _
                     ByVal NumberOfBytesWritten AS DWORD,ByVal lpOverlapped AS DWORD) AS DWORD

    DECLARE FUNCTION fh_close LIB "KERNEL32.DLL" ALIAS "CloseHandle" ( _
                     BYVAL hObject AS DWORD) AS LONG

' ------------------------------------------

FUNCTION save_file(fname$,src$) as DWORD

    LOCAL hFile as DWORD
    LOCAL pdat as DWORD         ' string pointer
    LOCAl ldat as DWORD         ' data length
    LOCAL bwrt as DWORD         ' bytes written variable

    hFile = fcreate_A(StrPtr(fname$),&H40000000&,0,0,2,&H00000080,0)

    pdat = StrPtr(src$)         ' get string address
    ! mov eax, pdat
    ! mov eax, [eax-4]          ' get string length
    ! mov ldat, eax

    file__write(hFile,pdat,ldat,VarPtr(bwrt),0)

    fh_close hFile

    FUNCTION = bwrt

End FUNCTION

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

FUNCTION monospace$(src$)

  ' ---------------------------------------------------------------------------
  ' left and right trim string, replace tabs with spaces and set single spacing
  ' ---------------------------------------------------------------------------
    #REGISTER NONE

    LOCAL pst as DWORD
    LOCAL dst$

    dst$ = src$                         ' work on copy

    pst = StrPtr(dst$)

    ! mov esi, pst
    ! sub esi, 1
    ! mov edi, pst

  trm:                                  ' trim leading tabs and spaces
    ! add esi, 1
    ! movzx eax, BYTE PTR [esi]
    ! cmp eax, 32
    ! je trm
    ! cmp eax, 9
    ! je trm

    ! sub esi, 1
    ! or ebx, -1                        ' set EBX non zero so it falls through the 1st TEST

  ' =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

  #align 4
  pre:
    ! test ebx, ebx                     ' test for zero AFTER its written.
    ! jz pastit

  stlp:
    ! add esi, 1
    ! movzx ebx, BYTE PTR [esi]
    ! cmp ebx, 9
    ! jne nxt1
    ! mov ebx, 32                       ' replace tabs with spaces

  nxt1:
    ! cmp ebx, 32
    ! jne nxt2
    ! movzx eax, BYTE PTR [esi+1]       ' test for following tab or space
    ! cmp eax, 32
    ! je pre
    ! cmp eax, 9
    ! je pre

  nxt2:
    ! mov [edi], bl                     ' write acceptable character
    ! add edi, 1
    ! test ebx, ebx                     ' test for zero AFTER its written.
    ! jnz stlp

  ' =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

  pastit:
    ! sub edi, 1
    ! cmp BYTE PTR [edi-1], 32          ' if last character is a space
    ! jne nxt3
    ! sub edi, 1

  nxt3:
    ! sub edi, pst                      ' length in EDI
    ! mov pst, edi

    FUNCTION = left$(dst$,pst)

END FUNCTION

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

Title: Re: moving thread from pb forums to here because of their server problems
Post by: bobl on August 30, 2015, 11:12:50 PM

Hutch
Yes the data is all perfectly accessible in this file now and whilst not all reports have "Annual Report and Accounts", most seem to have "Annual Report". Sometimes it's in a footer but because I don't need the first page it doesn't matter. I have seen "Annual Report" vertically on some reports so am not sure where that splits the page. I'm expecting to have to "eye-ball" some files for other reasons any way, e.g. too many numeric columns, albeit with automated assistance, so not a problem.

All in all this seems a much better solution to the xml one I was pusuing so...thank you very much for steering me towards this one.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: jj2007 on August 30, 2015, 11:18:34 PM

New version attached, with over a dozen Replace() statements. Open MS Excel and drag *.txt over the exe

Occasionally, you may see NAME# in Excel. This is because Excel interprets hyphens or plus signs at the beginning of a tab-delimited area as numeric fields.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: hutch-- on August 31, 2015, 12:24:04 AM

I gather in the longer term that you want to batch process a large number of files which may have at least slightly different notation so I wonder if its worth doing a sequence of searches with INSTR on each file to find if it has a known header or footer ?

The Line Input code is an old timer that performs OK but there is probably a faster way to do it, I envisaged something like a linear word search with INSTR to find the lead and trailing strings for each page them grabbing each page with MID$. Alternatively if its only particular pages you require, with page numbers you can scan the text for the page notation or if you need multiple pages create an array of page offsets so you can index your way through them.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: bobl on August 31, 2015, 08:15:41 PM

JJ
Thanks for your extra work. It's very much appreciated.
Hutch.
That sounds like very good advice and thank you for it.
I've only got the reports for a handfull of companies at the moment but even these few confirm that I'm up against quite a bit of non-uniformity.

Title: Re: moving thread from pb forums to here because of their server problems
Post by: hutch-- on August 31, 2015, 08:27:10 PM

Ok, I guess the trick is to make a lookup list of easily identifiable keywords or phrases that can identify a particular file layout from a given company. Now if you have multiple similar phrases you could stack the order to try the longer ones first then the shorter ones after it.

Code Select


1. Annual Report and Accounts
2. Annual Report
etc ....

Title: Re: moving thread from pb forums to here because of their server problems
Post by: bobl on August 31, 2015, 10:03:46 PM

>you could stack the order to try the longer ones first then the shorter ones after it
That's a very good point.
Yes I'll do that and once again thanks for the advice.

The MASM Forum

Specialised Projects => PowerBASIC => Topic started by: bobl on August 27, 2015, 11:12:39 PM