moving thread from pb forums to here because of their server problems

Started by bobl, August 27, 2015, 11:12:39 PM

Previous topic - Next topic

bobl

I just wanted to say that in addition to each token's x and y values I might also need the width and height values (along with the rotation figure to discard e.g. vertical text). That's because the token text snippets that make up the same line don't necessarily have the same y value (which would make grouping them trivial). I'm currently looking at the "base" value to see if I can use it as a proxy for y.

It might be worth holding fire re an asm-based extraction solution until I can verify that
a) those values are what I think they are and
b) I can come up with a fool-proof algorithm to reliably make up the lines.

I'll come back when I've got something and thank you for your patience.
Regards
Dean 

bobl

From my little tests the "base" value seems a perfect alternative to y so it looks like I can get away with extracting...
1 the PAGE NUMBER generated by incrementing a variable every time <PAGE is encountered and...
for each token, i.e. the string between <TOKEN and </TOKEN>, 
the values for...
      ROTATION...i.e. don't record anything if it's not zero...
      otherwise, if it is...
the values for...
      X value
      BASE value
      TEXT value i.e. the string between > and </TOKEN at the very end of the token

the same-ness of the base figure means I can avoid the fag-paper thin calculations I was once doing because phrases are either on the same line or they're not so....x and base can now probably be singles instead of doubles. The values seem to go to 4 decimal places.

hutch--

Dean,

I got you PDF file and at least know what the source was, is there any reason to convert the PDF to XML ? The conversion looks like it is both messy and lossy which is generating some of the problems parsing the XML file.

bobl

Hutch
That's a very good question.

The only reason I converted from pdf to xml is that I read that you can't extract the data in a comprehensible form from pdfs without first decoding, which I don't have a clue about. If that's not true, and you can get the above fields directly, I'd much prefer that, especially considering that the conversion isn't exactly instantaneous and that the 3rd party software I am relying on is lossy.

Re loss...I did try a number of of pdf-to-txt variants but these were even more lossy than the pdf-to-xml route, particularly in terms of structure.

hutch--

I don't think it would be viabe to try and directly parse PDF but I wondered if there is such a thing as a decent PDF to TEXT converter ? My Nitro Reader won't do it, locks up instead and I don't keep the Adobe version any longer due to the endless upgrades.

dedndave

for single-use, there are likely online converters

from what i remember, there is a library package for working with PDF's
if you can figure out how it plays, shouldn't be hard to write a little app   :P


bobl

It's worth pointing out that my comments re pdf-to-text offerings being lossy only applies to the free ones.
I looked at Debenu's Quick Pdf library (Dave might be the one you're thinking of) which gets good reviews from Powerbasic users...
http://www.powerbasic.com/support/pbforums/showthread.php?t=54013.
It's $499. If it works flawlessly and the cheaper ones below don't...it might be worth that.

I did find this today but am not sure if the conversion is pdf-to-text or the other way around
http://convert-pdf-software-review.toptenreviews.com.
Most of them look reasonably priced and therefore worth investigating.
Other than two they all seem to do batch conversion which for 30,000 files would be a must-have.
I use pdf xchange viewer and know they have a pro version so might talk to them too.

Adobe stuff always seems too big cumbersome to me + the constant updates, Hutch mentioned, would irritate me too.

Hutch
Thanks for the searching question which has prompted me to revisit the conversion tools.
Thank you.

Dave and jj2007
Thanks for your responses.
In truth...I'd love to get in there and write my own...but it would take me ages and I've got to finish this app asap. When it's running I'll have the time to "polish" it by doing work like this.

jj2007

Quote from: bobl on August 28, 2015, 10:23:32 PM
It's worth pointing out that my comments re pdf-to-text offerings being lossy only applies to the free ones.
I looked at Debenu's Quick Pdf library (Dave might be the one you're thinking of) which gets good reviews from Powerbasic users...
http://www.powerbasic.com/support/pbforums/showthread.php?t=54013.
It's $499. If it works flawlessly and the cheaper ones below don't...it might be worth that.

I once needed this, and used pdf2txt. It worked, but no miracles. The PDF format is very tricky.
Have a look at http://software.informer.com/search/pdf+text+converter
Judging by the number of downloads could be a starting point. For example, 170,563 downloads for
"All File to All File Converter 3000", which is shareware but costs 160$ ::)

If your files are similar in format, you might find a cheap or free solution that handles well your specific type of files. Problem are "complicated" documents created with the latest Adobe format.

bobl

Thanks for the information
pdf2txt was what I used and it lost some of the structure.
I just came back to say I've just pointed http://www.somepdf.com/some-pdf-to-txt-converter.html at my parent pdf directory, the sub-directories of which hold the actual pdfs, and whilst it crashed a little while after reporting that one of my pdfs was not open to being converted it seemed to convert a number of the pdfs quite well and quite quickly i.e. in batch mode.
Shame about the crash.

>Problem are "complicated" documents created with the latest Adobe format.
I'm guessing a lot of finance directors will want to have the latest Adobe tools.

bobl

"The PDF format can encode text either as ASCII values with a font applied, or it can encode it as a bitmap. If the tool that created your PDF decided to encode the special characters as a bitmap, you will be out of luck (unless you want to get into OCR solutions, of course)."

"It worse than this - text need not be laid out on the page in reading order. It need not be laid out rectilinearly.
Pasted from <http://stackoverflow.com/questions/1136990/how-can-i-extract-text-from-a-pdf-file-in-perl> "

That explains a lot!

A glimmer of hope...re ...to text
"
PDF2TXT.py This is what I use, although it is Python, it works flawlessly.
http://www.unixuser.org/~euske/python/pdfminer/index.html
Pasted from <http://stackoverflow.com/questions/1136990/how-can-i-extract-text-from-a-pdf-file-in-perl>
"
"
Yes, pdf2txt.py runs flawlessly ! –  mandy Jul 9 '11 at 11:16
Pasted from <http://stackoverflow.com/questions/1136990/how-can-i-extract-text-from-a-pdf-file-in-perl>
"
I installed and ran pdf2txt.py and got
===========================================================
51

CONSOLIDATED INCOME STATEMENT
FOR THE YEAR ENDED 31ST DECEMBER

Revenue
Net operating costs
Group operating profit
===========================================================
i.e. the same corrupt representation of "profit" that pdf2xml is giving so...not flawless...not without possibly some flag setting. By contrast "Some pdf" converts  "profit" fine. Shame it keels over when encountering problematic or protected pdfs.

bobl

I've been looking at various tools today and have just downloaded and had a go with ghostscript.
Playing with some of the flags and ending up with...

gswin32c -q -dNODISPLAY -dSAFER -dDELAYBIND -dWRITESYSTEMDICT -dCOMPLEX -c save -f ps2ascii.ps c:\2013.pdf -c quit >c:\2013.txt

...it's just produced the attached file which certainly looks very "uniform".
I'm just in the process of assembling it into lines to see how faithfully it reflects the original.

jj2007

Looks horrible :biggrin:

What is it? A database? Without knowing the original pdf, it's difficult to understand. What do you want to do with it, extract the data, or translate to a spreadsheet??

hutch--

Dean,

That looks like it can be parsed OK. Just grab the character(s) inside the brackets, the "(\243)" looks like the line breaks.

LATER:

This is the first try and the ghostscript format looks like it is consistent to parse.


' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    #include "\basic\include\win32api.inc"

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

FUNCTION PBmain as LONG

    LOCAL lft as DWORD
    LOCAL rgt as DWORD

    editor$ = "\basic\qeditor.exe"

    Open "ghostscript.txt" for Input as #1
    Open "output.txt" for Output as #2

  ' ***************
  ' line processing
  ' ***************
    Do
      Line Input #1, a$

      If instr(a$,"\243") <> 0 Then             ' line break character
        Print #2, chr$(13,10);
        ! jmp outlbl
      End if

      Select Case as CONST$ left$(a$,1)
        Case "S"                                ' normal text character
          lft = instr(a$,"(")+1
          rgt = instr(a$,")")
          Print #2, mid$(a$,lft,rgt-lft);

        Case "P"
          Print #2, "--------"+chr$(13,10);     ' show page breaks

        Case "F"
          Print #2, chr$(13,10)+a$              ' font line

        Case "C"
          Print #2, chr$(13,10)+a$

        Case "R"
          Print #2, chr$(13,10)+a$

        Case Else
          Print #2, a$                          ' default

      End Select

    outlbl:

    Loop while not eof(1)

    Close #2
    Close #1

  ' *****************
  ' global processing
  ' *****************
    src$ = load_file("output.txt")

    src$ = block_ltrim(src$)
    Replace chr$(13,10,13,10) with chr$(13,10) in src$
    save_file("output.txt",src$)

  ' *****************

    a& = shell(editor$+" output.txt",1)

End FUNCTION

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    DECLARE FUNCTION open_file_A LIB "KERNEL32.DLL" ALIAS "CreateFileA" ( _
                     ByVal lpFileName AS DWORD, _
                     ByVal dwDesiredAccess AS DWORD, _
                     ByVal dwShareMode AS DWORD, _
                     ByVal lpSecurityAttributes AS DWORD, _
                     ByVal dwCreationDisposition AS DWORD, _
                     ByVal dwFlagsAndAttributes AS DWORD, _
                     BYVAL hTemplateFile AS DWORD) AS DWORD

    DECLARE FUNCTION get_file_size LIB "KERNEL32.DLL" ALIAS "GetFileSize" ( _
                     BYVAL hFile AS DWORD, lpFileSizeHigh AS LONG) AS LONG

    DECLARE FUNCTION file__read LIB "KERNEL32.DLL" ALIAS "ReadFile" ( _
                     BYVAL hFile AS DWORD,ByVal pbuff as DWORD, BYVAL nNumberOfBytesToRead AS DWORD, _
                     ByVal lpNumberOfBytesRead AS DWORD, ByVal lpOverlapped AS DWORD) AS LONG

    DECLARE FUNCTION closefh LIB "KERNEL32.DLL" ALIAS "CloseHandle" ( _
                     BYVAL hObject AS DWORD) AS LONG

' ------------------------------------

FUNCTION load_file(fname$) as STRING

    LOCAL hFile as DWORD
    LOCAL flen  as DWORD
    LOCAL pdat  as DWORD         ' string pointer
    LOCAL bred  as DWORD         ' bytes read variable

    hFile = open_file_A(StrPtr(fname$),&H80000000& or &H40000000&,0,0,3,&H00000080,0)
    flen  = get_file_size(hFile,0)

    buffer$ = nul$(flen)
    pdat = StrPtr(buffer$)

    file__read(hFile,pdat,flen,VarPtr(bred),0)

    closefh hFile

    FUNCTION = buffer$

End FUNCTION

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

    DECLARE FUNCTION fcreate_A LIB "KERNEL32.DLL" ALIAS "CreateFileA" ( _
                     ByVal lpFileName AS DWORD, _
                     ByVal dwDesiredAccess AS DWORD, _
                     ByVal dwShareMode AS DWORD, _
                     ByVal lpSecurityAttributes AS DWORD, _
                     ByVal dwCreationDisposition AS DWORD, _
                     ByVal dwFlagsAndAttributes AS DWORD, _
                     BYVAL hTemplateFile AS DWORD) AS DWORD

    DECLARE FUNCTION file__write LIB "KERNEL32.DLL" ALIAS "WriteFile" ( _
                     BYVAL hFile AS DWORD,ByVal lpBuffer AS DWORD, _
                     BYVAL nNumberOfBytesToWrite AS DWORD, _
                     ByVal NumberOfBytesWritten AS DWORD,ByVal lpOverlapped AS DWORD) AS DWORD

    DECLARE FUNCTION fh_close LIB "KERNEL32.DLL" ALIAS "CloseHandle" ( _
                     BYVAL hObject AS DWORD) AS LONG

' ------------------------------------------

FUNCTION save_file(fname$,src$) as DWORD

    LOCAL hFile as DWORD
    LOCAL pdat as DWORD         ' string pointer
    LOCAl ldat as DWORD         ' data length
    LOCAL bwrt as DWORD         ' bytes written variable

    hFile = fcreate_A(StrPtr(fname$),&H40000000&,0,0,2,&H00000080,0)

    pdat = StrPtr(src$)         ' get string address
    ! mov eax, pdat
    ! mov eax, [eax-4]          ' get string length
    ! mov ldat, eax

    file__write(hFile,pdat,ldat,VarPtr(bwrt),0)

    fh_close hFile

    FUNCTION = bwrt

End FUNCTION

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

FUNCTION block_ltrim(txt$) as STRING

  ' ----------------------------------------------
  ' trim leading tabs and spaces on multiple lines
  ' of source and overwrite it with the results.
  ' ----------------------------------------------

    #REGISTER NONE

    LOCAL src as DWORD

    src = StrPtr(txt$)

    ! mov edx, src              ' source in EDX
    ! mov ecx, src              ' same address as target
    ! sub edx, 1

  #align 4
  trimleft:
    ! add edx, 1
    ! cmp BYTE PTR [edx], 32    ' loop back on space
    ! je trimleft
    ! cmp BYTE PTR [edx], 9     ' loop back on tab
    ! je trimleft
    ! sub edx, 1

  #align 4
  store:
    ! add edx, 1
    ! movzx eax, BYTE PTR [edx] ' copy byte
    ! mov [ecx], al
    ! add ecx, 1
    ! test al, al               ' test for written terminator
    ! jz bl_out                 ' exit on terminator
    ! sub al, 10                ' test for ascii 10
    ! jnz store
    ! jmp trimleft

  bl_out:
    ! sub ecx, src              ' sub src from ecx
    ! mov src, ecx              ' write ecx back to src (src reuse)
    FUNCTION = left$(txt$,src)  ' return basic string

END FUNCTION

' ¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤¤

hutch--

i eventually got Nitro reader to convert the original PDF to text  and it is a lot cleaner than the conversions we have tried so far. What makes it messy is it contains control codes that I have yet to find a set of conversions for. You can guess at some of them and clean up the results a bit but they look like old octal notation but don't all work the same way. I could directly replace the UK pound symbol but at least one of the control codes is 3 characters that convert to the characters "fi" so I don't yet know how to properly convert the data.