Organizations or teams producing manually many PDF documents might want to apply some quality checks before releasing documents. PDF checker enables to apply some basic automatic checks including searching for text presence (or lack of text), page numbering check, headings/paragraphs numbering and much more. Application is written in python and provides both CLI and web interface.
How it works?
Application is using pdfminer library to extract text from PDF documents – text is separated to individual lines (as recognized by pdfminer, which actually reconstructs lines from characters position on page) and each line is also supplemented by page number, it’s position on the page (top left point and bounding box of the line) and also some information about font used (name, variant, size).
Lines are then feed into checking algorithms (implemented as plugins), which check for issues (like finding some unwanted word). Issues are collected and then displayed together with their position in the document.
- User friendly web interface with integrated PDF viewer
- One click error highlight – Click on the error and PDF viewer highlights it to you
- CLI interface to use from scripts
- Flexible – new checks can be very easily added as plugins (see below)
- Easy to deploy
Implementing a new check is very easy. Create new module in
plugins directory – here sample dummy plugin (that looks for ‘Dummy’ word in document):
import re from common import CheckStrategy, Problem class DummyCheck(CheckStrategy): name="Sample Dummy Check" help="Checks plugin tutorial" def __init__(self,): super(DummyCheck, self).__init__() def feed(self, line): for m in re.finditer('Dummy',line.text): bbox=line.get_bbox(m.start(), m.end()) p=Problem('Found Dummy', line) p.bbox=bbox self.results.add_problem(p)
Plugin is a class extending
CheckStrategy base class. New plugin must have class property
name, with unique name of your check. Optionally it can have
If you are overwriting
__init__ method do not forget to call super’s
Check is done by overriding
feed method. This method will receive a
line of text as it is retrieved from PDF document. Parameter
line is of type
TextLine, it has property
text, which is string containing text of the line.
TextLine has other useful properties – like
page_no (page number),
left ( relative position (in percent) of line beginning ),
bbox (bounding box of text line in page coordinates). And some useful methods – like get_bbox (which gets bounding box for a substring of the line) or size_at or font_at (which gets information about font size or font name for character at given index.)
If any problem is found, it should be added to
self.results collection, either via
add_problem method (requiring
Problem instance as an input) or just
add method (which takes description of problem and line as its parameters). If results cannot be calculated on the fly, you can collect some information within
feed method then and override
prepare_results method, which is called at the end, when all document is processed.
That is all – check plugins are automatically loaded and used in the program. However sometime you might need to initialize check with some parameters, before using it. In this case you can supply function
create_instance in plugin module code, which provides appropriately initiated instance of your check:
def create_instance(): i= ExistingCheckClass('some init params') i.change_name('New Name') i.change_help('New help') return i
You can change name and help for this check, if reusing existing check with different parameters.
Download and install
Code is available under GPL v3 license.
Code can be downloaded from github. You”l need to write your own checking plugins to make it useful – because included ones are quite specific for my case.
For installation instructions check README file.