moving thread from pb forums to here because of their server problems

Started by bobl, August 27, 2015, 11:12:39 PM

Previous topic - Next topic

bobl

jj
the orginal is a pdf but it's too big to post here so I've attached a screen shot of a page
Hutch
Thanks for the the "(\243)" line break advice.
Last night it took me a few goes to break the file....with $LF.
I'm just seeing \243 now.
Re those control codes...There's a lot more pitfalls in doing this than I had expected.
Thank you very much indeed for your work.
I'm still parsing at the moment.

jj2007

If the screenshot would correspond to a text file, we would have a chance to understand it. Could you post the textfile that was converted from this pdf?

hutch--

Dean,

What I am interested in is what do you need to get out of the PDF file ? There is a lot of information in the PDF file, image data, tables and text data and a lot of formatting as well.

From the original PDF file converted to text with Nitro Reader I can get this type of data with some replacements and further massaging.


52
Group plc
Annual Report and Accounts 2013
CONSOLIDATED STATEMENT OF COMPREHENSIVE INCOME
FOR THE YEAR ENDED 31ST DECEMBER
Notes
2013
£m
2012
Restated
(note 6)
£m
Profit for the year 0.7 1.5
Other comprehensive income/(losses) to be reclassified to profit or loss in subsequent periods:
Exchange adjustments on hedge of net investments 19 0.6
Exchange differences on translation of foreign operations 0.4 (2.8) Tax on items taken to reserves (0.4)
Net other comprehensive income/(losses) to be reclassified to profit or loss in
subsequent periods (2.2)
Items not to be reclassified to profit or loss in subsequent periods:
Actuarial gains/(losses) on defined benefit pension schemes 6 2.3 (32.7) Tax on items taken to reserves 8 (2.3) 7.4
Net other comprehensive losses not to be reclassified to profit or loss in subsequent
periods (25.3)
Other comprehensive losses for the year (27.5)
Total comprehensive income/(losses) for the year 0.7 (26.0)
Attributable to:
Equity holders of the parent 0.8 (25.9) Non-controlling interests (0.1) (0.1)
Total comprehensive income/(losses) for the year 0.7 (26.0)
Notes to the accounts are on pages 57 to 82.


PS: The notation \243 is the UK pound symbol.

bobl

I just came back to ask if you've discovered what the page delimiter is i.e. I've got some output but characters are missing. I foolishly tried the "P" on it's own line.

JJ
I was intending to post the converted page to go with that pdf page and that's what I'm trying to find by parsing the file.

Hutch
Yes that a very good question.

The short answer is the lines in jj's screen shot above your post for the income statement, balance sheet and cashflow statement....and preferably the consolidated versions of these statements, if present.
Ideally the lines start with a textual title and end in two numbers...this years and last years.
The statements are in very close proximity i.e. often a page each and consecutive.

Here's an example
Revenue                    1111.11           2222.22
or
Revenue        3           1111.11           2222.22

and I'd like to create the following table.
revenue | 1111.11 | 2222.22
cogs       | 33.11     | 44.22

Unfortunately, it's not uncommon for the items to digress from this ideal format (see the balance sheet) and automatically correcting them will take another level of parsing.

At this stage getting the right pages (or all of them) with the lines intact would be great.

Hope that explains and thank you for your help...if not please let me know.

bobl

I read in some university's pdf to text tutorial that the ps2ascii uitlity that comes with ghostscript is not very robust. By contrast ghostscript was deemed "wonderful". Trying to eliminate the former I tried tried this....

gswin32c -dNOPAUSE -sDEVICE=txtwrite -dFirstPage=54 -dLastPage=57 -sOutputFile=output.txt -q 2013.pdf -c quit

and got the attached file of the "statements of interest".

I also looked into the "corrupted" "fi" in "profit" and found the following explanation...

"
fi and fl are character codes 174 and 175 in Adobe StandardEncoding.  Character codes 174 and 175 are "registered" and "macron" in ISO Latin 1 and encodings derived therefrom (such as Windows ANSI, LY1 etc.).  So somehow you have old TFMs lying around set up for Adobe StandardEncoding.
Pasted from <http://www.verycomputer.com/18_7c47d859451ce7d4_1.htm>
"



jj2007

What is this?
          Current tax assets           â€"          1.1        1.0

What is the difference - they both seem "fi":
ProÂ"t before tax
..
Net finance costs

I've given it a first shot (attached). It looks almost convincing in Excel. One problem is that sometimes there is a note, sometimes not (e.g. Profit before tax):
Revenue 3 250.4 244.6
Net operating costs 4 (242.2) (238.1)
Group operating profit 4 8.2 6.5
Pension charge 6 (3.5) (2.9)
Non-recurring costs 4 (2.2) (1.7)
Profit before finance income/(costs) and tax 2.5 1.9
Finance income 7 0.1 0.4
Finance costs 7 (1.5) (0.6)
Profit before tax 1.1 1.7
Tax expense 8 (0.4) (0.2)
Profit for the year 0.7 1.5
Profit attributable to:
Equity holders of the parent 0.8 1.6
Non-controlling interests (0.1) (0.1)
Profit for the year 0.7 1.5


Source:
include \masm32\MasmBasic\MasmBasic.inc
  Init
  Let esi=CL$() ; you may pass a textfile in the command line
  .if Len(esi)==1
Let esi="Output.txt"
  .endif
  Let esi=FileRead$(esi)
  Let esi=Replace$(esi, "Â"", "fi")
  Let esi=Replace$(esi, "fi", "fi")
  Let esi=Replace$(esi, "ï¬,", "fl")
  Let esi=Replace$(esi, "£", "Ps") ; pound sterling
  Let esi=Replace$(esi, "â€"", "xxx€") ; Euros?
  mov ecx, Len(esi)
  Let edi=New$(ecx)
  push esi
  push edi
  add ecx, esi ; startpos+Len(esi), marks end of parsing on stack
  push ecx
  .Repeat
lodsb
.if al==32
.if dword ptr [esi]==20202020h
.Repeat
lodsb
.Until al!=32
dec esi
mov al, 9 ; one tab
.endif
.elseif al==13 ; CrLf
stosb
movsb
.Repeat
lodsb
.Until al>32
.endif
stosb
  .Until esi>=stack
  pop ecx
  pop edi
  pop esi
  FileWrite "Output.tab", edi
  sub ecx, esi
  Print Str$("%i bytes processed", ecx)
  Exit
EndOfCode

bobl

JJ
I just came back to post this file cleaner....

#COMPILE EXE
'#DIM ALL

#INCLUDE "win32api.inc"

FUNCTION clean_ln(ln$) AS STRING
   new_ln$ = ln$
   REPLACE "Â"" WITH "fi" IN new_ln$
   REPLACE "fi" WITH "fi" IN new_ln$
   REPLACE "£" WITH "£" IN new_ln$
   REPLACE "â€"" WITH "-" IN new_ln$
   REPLACE "ï¬," WITH "fl" IN new_ln$
   FUNCTION = new_ln$
END FUNCTION


FUNCTION PBMAIN () AS LONG
   editor$="c:\qedit35\qeditor"
    OPEN "output.txt" FOR INPUT AS #1
    OPEN "output2.txt" FOR OUTPUT AS #2
      WHILE ISFALSE(EOF(#1))
         LINE INPUT #1,ln$
         ln$=clean_ln(ln$)
         PRINT #2,ln$
      WEND
    CLOSE #2
    CLOSE #1
   a& = SHELL(editor$+" output2.txt",1)
END FUNCTION


....am reading your post now. Thank you for it.

Later...
Thanks for the masm code...I need that!

>sometimes there is a note
Well spotted...and yes it's optional...The way to deal with that and the title is to count inwards from both sides.
i.e.
for the title....
go in from the left move rightwards until you hit a number and keep going until you see a letter after it...If you don't then the character before that first number you hit is the end of the title.
for the numbers...
From the right move leftwards two non-space groupings and those are your..."this year's" and "last year's" figures. The problem is authors often add more than 2 numbers a lot more than I'd like which is tricky to deal with and will probably involve trying to identify the title fields and count the fields in from the right until you see just the current year on it's own i.e. 2015 and count that many non-space groups in. If I can't get the title column or it isn't plain "2015" then I'll have to suck the figures off manually. I do have a very good tool for this called Monarch pro but it doesn't like this 2013 pdf whereas everything else I have opens it fine. Probably 'cos it's a few years old now and as you pointed out, the specs change.

Thanks for your program. I ran it and lotus opened it fine (when I changed the extension to .txt)

jj2007

Quote from: bobl on August 30, 2015, 12:20:30 AMThanks for your program. I ran it and lotus opened it fine (when I changed the extension to .txt)

My pleasure. I attach a new version that handles the columns correctly and opens output.tab in Excel if Excel is running.

hutch--


   REPLACE "Â"" WITH "fi" IN new_ln$
   REPLACE "fi" WITH "fi" IN new_ln$
   REPLACE "£" WITH "£" IN new_ln$
   REPLACE "â€"" WITH "-" IN new_ln$
   REPLACE "ï¬," WITH "fl" IN new_ln$


This is the type of stuff we need to clean up the converted file. If we can get a full list of these on standard control codes, parsing the XML file would be a lot easier.

jj2007

What is the "â€""?
New version attached above, before Hutch' post.

hutch--

I think the "â€"" is the same a a bullet mark in word. A "-" seems to do the job and it reads OK.

bobl

Thanks JJ and Hutch
I don't know if this is relevant
// Adobe Standard Encoding table for ttf2pt1
// Thomas Henlich <Thomas.Henlich@mailbox.tu-dresden.de>
Pasted from <http://get-software.net/fonts/utilities/ttf2pt1/maps/adobe-standard-encoding.map>

I've attached the "whole" file. Looking through it "ff" in efficiency is missing but in other places there are funny character sequences for it. In the attached program I just included the whole word for conversion.

One thing about the output...is pages. There aren't any :) . I managed to copy the textual output to one page per file (see %%d in the output file name....double percent so as not to conflict with DOS' %) which I accidently left in the attached bactch file.

I'm still working on how you just write page separators to the single output file.
The file size is now about 500k and not the 12-13mb Hutch originally questioned
i.e. causing this switch and which given the 24X reduction was a very good call.

bobl

That's interesting....
http://ghostscript.com/doc/7.07/Use.htm#PDF_switches
Ghostscript always expects the first line of a pdf to start...%PDF e.g. %PDF-1.2.
Looking at 2013.pdf it's a %PDF 1.5 so that narrows down what we're looking re code conversions at a bit.
The advice also says if there's a problem converting you can ditch any rubbish before %PDF and it might then work i.e. Ghostscript expects this as a starting point whereas e.g. Adobe's viewer is more liberal.

hutch--

Hi Dean,

This is a much better result, I built your cleaner and ran it on the test file and it worked fine. Now the question is, is the resulting output suitable to parse out the data you need ?

LATER:

RE: The page numbers, does this properly identify the pages in the file. I have used your set of character replacements then parsed each line that contains "Annual Report and Accounts" reversing the page number when its aligned right so it is now left.


' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    #include "\basic\include\win32api.inc"

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

FUNCTION PBmain as LONG

    editor$ = "\basic\qeditor.exe"

  ' -----------------
  ' GLOBAL operations
  ' -----------------
    src$ = load_file("2013.txt")
    Replace "Â"" with "fi" in src$
    Replace "fi" with "fi" in src$
    Replace "£" with "£" in src$
    Replace "â€"" with "-" in src$
    Replace "ï¬," with "fl" in src$
    Replace "’" with "'" in src$
    Replace "‘" with "'" in src$
    Replace "" with "ff" in src$
    Replace "eicency" with "efficiency" in src$
    save_file("temp.txt",src$)

  ' ----------------------
  ' single line operations
  ' ----------------------
    Open "temp.txt" for Input as #1
    Open "cleaned.txt" for Output as #2

    Do
      Line Input #1, a$

      If instr(a$,"Annual Report and Accounts") <> 0 Then
        a$ = remove$(a$,"2013")
        a$ = monospace$(a$)

      ' -------------------------------------
      ' get page number from right sided page
      ' -------------------------------------
        If left$(a$,1) = "A" Then
          numb$ = ltrim$(right$(a$,2))
          a$ = "Page " + numb$ + " Annual Report and Accounts"
          Print #2, a$
          ! jmp bypass
        End If

      ' -------------------------------------
      ' output the left side page number line
      ' -------------------------------------
        a$ = "Page " + a$
        Print #2, a$
        ! jmp bypass

      End If

    ' ---------------------
    ' print any other lines
    ' ---------------------
      Print #2, a$

    bypass:

    Loop while not eof(1)

    Close #2
    Close #1

    x& = shell(editor$+" cleaned.txt",1)

End FUNCTION

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    DECLARE FUNCTION open_file_A LIB "KERNEL32.DLL" ALIAS "CreateFileA" ( _
                     ByVal lpFileName AS DWORD, _
                     ByVal dwDesiredAccess AS DWORD, _
                     ByVal dwShareMode AS DWORD, _
                     ByVal lpSecurityAttributes AS DWORD, _
                     ByVal dwCreationDisposition AS DWORD, _
                     ByVal dwFlagsAndAttributes AS DWORD, _
                     BYVAL hTemplateFile AS DWORD) AS DWORD

    DECLARE FUNCTION get_file_size LIB "KERNEL32.DLL" ALIAS "GetFileSize" ( _
                     BYVAL hFile AS DWORD, lpFileSizeHigh AS LONG) AS LONG

    DECLARE FUNCTION file__read LIB "KERNEL32.DLL" ALIAS "ReadFile" ( _
                     BYVAL hFile AS DWORD,ByVal pbuff as DWORD, BYVAL nNumberOfBytesToRead AS DWORD, _
                     ByVal lpNumberOfBytesRead AS DWORD, ByVal lpOverlapped AS DWORD) AS LONG

    DECLARE FUNCTION closefh LIB "KERNEL32.DLL" ALIAS "CloseHandle" ( _
                     BYVAL hObject AS DWORD) AS LONG

' ------------------------------------

FUNCTION load_file(fname$) as STRING

    LOCAL hFile as DWORD
    LOCAL flen  as DWORD
    LOCAL pdat  as DWORD         ' string pointer
    LOCAL bred  as DWORD         ' bytes read variable

    hFile = open_file_A(StrPtr(fname$),&H80000000& or &H40000000&,0,0,3,&H00000080,0)
    flen  = get_file_size(hFile,0)

    buffer$ = nul$(flen)
    pdat = StrPtr(buffer$)

    file__read(hFile,pdat,flen,VarPtr(bred),0)

    closefh hFile

    FUNCTION = buffer$

End FUNCTION

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    DECLARE FUNCTION fcreate_A LIB "KERNEL32.DLL" ALIAS "CreateFileA" ( _
                     ByVal lpFileName AS DWORD, _
                     ByVal dwDesiredAccess AS DWORD, _
                     ByVal dwShareMode AS DWORD, _
                     ByVal lpSecurityAttributes AS DWORD, _
                     ByVal dwCreationDisposition AS DWORD, _
                     ByVal dwFlagsAndAttributes AS DWORD, _
                     BYVAL hTemplateFile AS DWORD) AS DWORD

    DECLARE FUNCTION file__write LIB "KERNEL32.DLL" ALIAS "WriteFile" ( _
                     BYVAL hFile AS DWORD,ByVal lpBuffer AS DWORD, _
                     BYVAL nNumberOfBytesToWrite AS DWORD, _
                     ByVal NumberOfBytesWritten AS DWORD,ByVal lpOverlapped AS DWORD) AS DWORD

    DECLARE FUNCTION fh_close LIB "KERNEL32.DLL" ALIAS "CloseHandle" ( _
                     BYVAL hObject AS DWORD) AS LONG

' ------------------------------------------

FUNCTION save_file(fname$,src$) as DWORD

    LOCAL hFile as DWORD
    LOCAL pdat as DWORD         ' string pointer
    LOCAl ldat as DWORD         ' data length
    LOCAL bwrt as DWORD         ' bytes written variable

    hFile = fcreate_A(StrPtr(fname$),&H40000000&,0,0,2,&H00000080,0)

    pdat = StrPtr(src$)         ' get string address
    ! mov eax, pdat
    ! mov eax, [eax-4]          ' get string length
    ! mov ldat, eax

    file__write(hFile,pdat,ldat,VarPtr(bwrt),0)

    fh_close hFile

    FUNCTION = bwrt

End FUNCTION

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

FUNCTION monospace$(src$)

  ' ---------------------------------------------------------------------------
  ' left and right trim string, replace tabs with spaces and set single spacing
  ' ---------------------------------------------------------------------------
    #REGISTER NONE

    LOCAL pst as DWORD
    LOCAL dst$

    dst$ = src$                         ' work on copy

    pst = StrPtr(dst$)

    ! mov esi, pst
    ! sub esi, 1
    ! mov edi, pst

  trm:                                  ' trim leading tabs and spaces
    ! add esi, 1
    ! movzx eax, BYTE PTR [esi]
    ! cmp eax, 32
    ! je trm
    ! cmp eax, 9
    ! je trm

    ! sub esi, 1
    ! or ebx, -1                        ' set EBX non zero so it falls through the 1st TEST

  ' =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

  #align 4
  pre:
    ! test ebx, ebx                     ' test for zero AFTER its written.
    ! jz pastit

  stlp:
    ! add esi, 1
    ! movzx ebx, BYTE PTR [esi]
    ! cmp ebx, 9
    ! jne nxt1
    ! mov ebx, 32                       ' replace tabs with spaces

  nxt1:
    ! cmp ebx, 32
    ! jne nxt2
    ! movzx eax, BYTE PTR [esi+1]       ' test for following tab or space
    ! cmp eax, 32
    ! je pre
    ! cmp eax, 9
    ! je pre

  nxt2:
    ! mov [edi], bl                     ' write acceptable character
    ! add edi, 1
    ! test ebx, ebx                     ' test for zero AFTER its written.
    ! jnz stlp

  ' =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

  pastit:
    ! sub edi, 1
    ! cmp BYTE PTR [edi-1], 32          ' if last character is a space
    ! jne nxt3
    ! sub edi, 1

  nxt3:
    ! sub edi, pst                      ' length in EDI
    ! mov pst, edi

    FUNCTION = left$(dst$,pst)

END FUNCTION

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

bobl

Hutch
Yes the data is all perfectly accessible in this file now and whilst not all reports have "Annual Report and Accounts", most seem to have "Annual Report". Sometimes it's in a footer but because I don't need the first page it doesn't matter. I have seen "Annual Report" vertically on some reports so am not sure where that splits the page. I'm expecting to have to "eye-ball" some files for other reasons any way, e.g. too many numeric columns, albeit with automated assistance, so not a problem.

All in all this seems a much better solution to the xml one I was pusuing so...thank you very much for steering me towards this one.