PDF Checker

Organizations or teams producing manually many PDF documents might want to apply some quality checks before releasing documents.  PDF checker enables to apply some basic automatic  checks including searching for text presence (or lack of text), page numbering  check, headings/paragraphs numbering and much more.  Application is written in python and provides both CLI and web interface.

How it works?

Application is using pdfminer library to extract text from PDF documents –  text is separated to individual lines (as recognized by pdfminer, which actually reconstructs lines from characters position on page) and each line is also supplemented by page number, it’s position on the page (top left point and bounding box of the line) and also some information about font used (name, variant, size).

Lines are then feed into checking algorithms (implemented as plugins),  which check for issues (like finding some unwanted word). Issues are collected and then displayed together with their position in the document.

Features

  • User friendly web interface with integrated PDF viewer
  • One click error highlight – Click on the error and PDF viewer highlights it to you
  • CLI interface to use from scripts
  • Flexible –  new checks can be very easily added as plugins (see below)
  • Easy to deploy

Screenshots

Writing Plugin

Implementing a new check is very easy.   Create new module in plugins directory – here sample dummy plugin (that looks for ‘Dummy’ word in document):

import re
from common import CheckStrategy, Problem

class DummyCheck(CheckStrategy):
    name="Sample Dummy Check"
    help="Checks plugin tutorial"
    
    def __init__(self,):
        super(DummyCheck, self).__init__()
        
    def feed(self, line):
        for m in re.finditer('Dummy',line.text):
            bbox=line.get_bbox(m.start(), m.end())
            p=Problem('Found Dummy', line)
            p.bbox=bbox
            self.results.add_problem(p)

Plugin is a class extending CheckStrategy base class.  New plugin must have class property name, with unique name of your check. Optionally it can have help and optional properties.

If you are overwriting __init__ method do not forget to call super’s __init__ .

Check is done by overriding feed method.  This method will receive a line of text as it is retrieved from PDF document.   Parameter line is of type TextLine, it has property text, which is string containing text of the line.  TextLine has other useful properties – like page_no (page number),  top and left ( relative position (in percent)  of line beginning ), bbox (bounding box of text line in page coordinates).   And some useful methods – like get_bbox (which gets bounding box for a substring of the line) or size_at or font_at (which gets information about font size or font name for character at given index.)

If any problem is found, it should be added to self.results collection,  either via add_problem method (requiring Problem instance as an input)  or just add method (which takes description of problem and line as its parameters). If results cannot be calculated on the fly, you can collect some information within feed method  then and override prepare_results method, which is called at the end, when all document is processed.

That is all – check plugins are automatically loaded and used in the program.  However sometime you might need to initialize check with some parameters, before using it.   In this case you can supply function create_instance in plugin module code, which  provides appropriately initiated instance of your check:

def create_instance():
    i= ExistingCheckClass('some init params')
    i.change_name('New Name')
    i.change_help('New help')
    return i

You can change name and help for this check, if reusing existing check with different parameters.

Download and install

Code is available under GPL v3 license.

Code can be downloaded from github. You”l need to write your own checking plugins to make it useful – because included ones are quite specific for my case.

For installation instructions check README file.

 

 

My Digital Bits And Pieces