Parsing PDF for Fun And Profit (indeed in Python)

PDF documents are ubiquitous in today’s world. Apart of common use cases of printing, viewing etc. we need sometimes do something specific with them- like convert tehm to other formats or extract textual content.  Extracting text from PDF document can be (surprisingly) hard task due to the purpose and design of PDF documents.  PDF is intended to represent exact visual representation of document ‘s pages down to the smallest details. And internal representation of document text is following this goal.  Rather the storing text in some logical units (lines, paragraphs, columns, tables …), text is represented as series of commands, which print characters (can be a single character, word, part of line, …) at exact position on the page with given font, font size, color, etc.   In order to reconstruct original text logical structure program  has to scan  all these commands and join together texts, which were probably forming same line or same paragraph.  This task can be pretty demanding and ambiguous –  mutual position of text boxes can be interpreted in various ways ( is this space between words too large because they are in different columns or line is justified to both ends?).

So the task of text extraction looks quite discouraging to try, luckily some smart guys have tried it already and left us with libraries that are doing pretty good job and we can leverage them. Some time ago I’ve created tool called PDF Checker, which does some analysis of PDF document content (presence, absence of some phrases,  paragraphs numbering, footers format etc.). I used there excellent Python PDFMiner library.   PDFMiner is a grea tool and it is quite flexible, but being all written in Python it’s rather slow.   Recently I’ve been looking for some alternatives,  which have Python bindings and provide functionality similar to PDFMiner.  In this article I describe some results of this search, particularly my experiences with libpoppler.

Requirements

In order to analyze  in detail text of PDF document I require to:

  • get number of pages in PDF document and for each page its size
  • extract text from the page, ideally grouped to lines and paragraphs (boxes)
  • get bounding boxes of text items (down to individual characters) to analyze text based on it’s position on the page – header/footer, indentation, columns etc.
  • get font name, size and color, background color  – to identify headers, highlights etc.

PDFMiner library

PDFMiner library was my first attempt, so generally most of requirements are derived from its interface.  The only things which are really missing there are font and background colours.  Here is an example how to dump text content of PDF file together with its position on the page and format:

As you can see with a very simple wrapper class around PDFMiner  library we can easily iterate PDF document text down to individual characters, each level provides exact position information, lowest level provides also font information.

PDFMiner is very flexible, you can set few parameters to control layout analysis and thus fine tune text parsing to your need. But as already said PDFMiner is quite slow, does not provide font colour information and also does not support python 3.

 libpoppler with GObject Introspection interface

Poppler is a PDF rendering and parsing library based on the xpdf-3.0 code base. It’s now hosted as part of freedesktop.org and is actively maintained. libpoppler is used in many opensource PDF tools (Evince, Okular, GIMP, …) and provides rich functionality for both parsing and rendering.

Poppler package  contains GObject Introspection binding, which means it can be used from scripting languages like Python or Javascript.  However there is significant issue with introspection for one function, which is used for getting positions of text on the page – so workaround has to be used in this area.

Below is code that dumps PDF text in a similar way to script with PDFMiner:

Comparing to PDFMiner solution it doesn’t provide aggregation on text box or text line level – lines should be reconstructed from page text.  But it provides color (but not background colour) of the font for individual characters.

libpoppler with custom Python binding

Recently I’ve learned more about Cython so I decided to give it a try and create my own interface to libpoppler. Actually it was not so difficult and quite quickly I was able to create  Python binding, which I can use easily as the replacement for PDFminer in my project.   The most notable difference is reversed orientation of y-axis ( PDFminer has 0 at page bottom, here it is at page top).  Below is the sample code for dumping PDF text (with position and font information):

As you can clearly see the source code is the shortest, yet still provides all necessary data, including font colour (but not background colour).  Apart of being significantly faster ( see below for detailed benchmarks),  it seems to be more reliable then PDFMiner (it solved few issues where PDFminer behaved strangely).

Benchmarks

I’ve done simple benchmarks of these three scripts, time is measured by time linux utility as mean of 3 runs (after initial run to exclude possible caching effects), output is directed to /dev/null to exclude terminal printing time. Benchmarks runs on my notebook with Ubuntu 14.04 64 bit, Core i5 CPU @ 2.70GHz and 16GB memory:

libppopler with cython libpoppler with GI binding pdfminer
tiny document (half page) 0.033s 0.098s 0.121s
small document (5 pages) 0.141s 0.499s 0.810s
medium document (55 pages) 1.166s 4.860s 10.524s
large document (436 pages) 10.581s 34.621s 108.095s

As you can see custom libpoppler binding with Cython is about 3x faster then GObject introspection interface and  10x times faster then PDFMiner (for large documents).

Conclusions

Custom interface to libpoppler is the clear winner in all aspects – performance, code simplicity / pythonic interface and reliability.  I’ll be using it in my further projects.

My Python libpoppler binding is freely available on github under GPL v3 license ( as libpoppler is GPL licensed too) and you are invited to try it.  I’d like to hear your feedback.

12 thoughts on “Parsing PDF for Fun And Profit (indeed in Python)”

  1. Hey. I tried your interface and I noticed a possible problem and I have some questions. The problem is with the page size . Sometimes , for some pages, the size is not correct. I think it’s rotated (width = real_height and height = real_width), but I might be wrong. I don’t really know how pdf format is , but did you consider the orientation (landscape – portrait ) ?

    The questions are related to other parts of a pdf file , like lines and images. Are you going to complete this interface , so I can extract photos and other elements position ?

    1. short answer is NO for two reasons:
      a) I use text mostly
      b) it’s just relatively simple binding to libpoppler which provides text extraction, anything else will be much more complicated.

      for incorrect page size I take it directly from libpoppler, but if you have clear example send it to me and I’ll look.

  2. Hi,great work .I don’t have installed poppler lib on my system.I need import the compiled poppler lib and import it on your cython interface?.How I do that?,Please help me!.thanks

  3. Yes, i can do it .Now, i need the text information ,no font,no color ,only text and i need some advice about make libpoppler.so.6x more small for my needs(it is about 6.7MB right now).

    1. libpoppler is C++ – so whatever works for C++ – not sure what all is statically linked there – try to play with it’s configure script. I’m just using it as it is.

      1. Rigth now ,i have many issues getting the pdf info(author,creation date,title,etc).Can you help me ,please?.I look into poppler lib and i see how the pdfinfo plugin get the info,but when i try to get it with cython i get some errors.

  4. It’s great !!
    But , I tested test_docs > test1.pdf.
    at “zastupitele” change font color. but result is

    z (101.50, 361.01, 106.80, 374.29) LiberationSerif 12.0 r:1.00 g:0.40, b:0.00
    a (106.80, 361.01, 112.11, 374.29) LiberationSerif 12.0 r:1.00 g:0.40, b:0.00
    s (112.11, 361.01, 116.80, 374.29) LiberationSerif 12.0 r:1.00 g:0.40, b:0.00
    t (116.80, 361.01, 120.10, 374.29) LiberationSerif 12.0 r:1.00 g:0.40, b:0.00
    u (120.10, 361.01, 126.10, 374.29) LiberationSerif 12.0 r:1.00 g:0.40, b:0.00
    p (126.10, 361.01, 132.10, 374.29) LiberationSerif 12.0 r:1.00 g:0.40, b:0.00
    i (132.10, 361.01, 135.40, 374.29) LiberationSerif 12.0 r:1.00 g:0.40, b:0.00
    t (135.40, 361.01, 138.78, 374.29) LiberationSerif 12.0 r:1.00 g:0.40, b:0.00
    e (138.78, 361.01, 144.09, 374.29) LiberationSerif 12.0 r:1.00 g:0.40, b:0.00
    l (144.09, 361.01, 147.39, 374.29) LiberationSerif 12.0 r:1.00 g:0.40, b:0.00
    e (147.39, 361.01, 152.69, 374.29) LiberationSerif 12.0 r:1.00 g:0.40, b:0.00

    same RGB value.

      1. Color data are taken from libpoppler – generally just passing what I get. I did not work with colors, so not sure how accurately it works. You may file issue on github.

  5. Hi, I liked the look of your interface pdfparser-master. Will it run on Mac? I’ve struggled to install it there. Advice gratefully received.

    Kind regards,

    William

    1. Basically it should work on Mac (you’ll need there gcc, python and cython, which I believe should be available), but I have no experiences with Mac and have not tested it on Mac.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">