PDF documents are ubiquitous in today’s world. Apart of common use cases of printing, viewing etc. we need sometimes do something specific with them- like convert tehm to other formats or extract textual content. Extracting text from PDF document can be (surprisingly) hard task due to the purpose and design of PDF documents. PDF is intended to represent exact visual representation of document ‘s pages down to the smallest details. And internal representation of document text is following this goal. Rather the storing text in some logical units (lines, paragraphs, columns, tables …), text is represented as series of commands, which print characters (can be a single character, word, part of line, …) at exact position on the page with given font, font size, color, etc. In order to reconstruct original text logical structure program has to scan all these commands and join together texts, which were probably forming same line or same paragraph. This task can be pretty demanding and ambiguous – mutual position of text boxes can be interpreted in various ways ( is this space between words too large because they are in different columns or line is justified to both ends?).
So the task of text extraction looks quite discouraging to try, luckily some smart guys have tried it already and left us with libraries that are doing pretty good job and we can leverage them. Some time ago I’ve created tool called PDF Checker, which does some analysis of PDF document content (presence, absence of some phrases, paragraphs numbering, footers format etc.). I used there excellent Python PDFMiner library. PDFMiner is a grea tool and it is quite flexible, but being all written in Python it’s rather slow. Recently I’ve been looking for some alternatives, which have Python bindings and provide functionality similar to PDFMiner. In this article I describe some results of this search, particularly my experiences with libpoppler. Continue reading Parsing PDF for Fun And Profit (indeed in Python)