PDF documents are ubiquitous in today’s world. Apart of common use cases of printing, viewing etc. we need sometimes do something specific with them- like convert tehm to other formats or extract textual content. Extracting text from PDF document can be (surprisingly) hard task due to the purpose and design of PDF documents. PDF is intended to represent exact visual representation of document ‘s pages down to the smallest details. And internal representation of document text is following this goal. Rather the storing text in some logical units (lines, paragraphs, columns, tables …), text is represented as series of commands, which print characters (can be a single character, word, part of line, …) at exact position on the page with given font, font size, color, etc. In order to reconstruct original text logical structure program has to scan all these commands and join together texts, which were probably forming same line or same paragraph. This task can be pretty demanding and ambiguous – mutual position of text boxes can be interpreted in various ways ( is this space between words too large because they are in different columns or line is justified to both ends?).
So the task of text extraction looks quite discouraging to try, luckily some smart guys have tried it already and left us with libraries that are doing pretty good job and we can leverage them. Some time ago I’ve created tool called PDF Checker, which does some analysis of PDF document content (presence, absence of some phrases, paragraphs numbering, footers format etc.). I used there excellent Python PDFMiner library. PDFMiner is a grea tool and it is quite flexible, but being all written in Python it’s rather slow. Recently I’ve been looking for some alternatives, which have Python bindings and provide functionality similar to PDFMiner. In this article I describe some results of this search, particularly my experiences with libpoppler. Continue reading Parsing PDF for Fun And Profit (indeed in Python)
I’ve have been aware of Cython for a few years but newer had chance to really test it in practice (apart of few dummy exercises). Recently I’ve decided to look at it again and test it on my old project adecapcha. I was quite pleased with results, where I was able speed up the program significantly with minimum changes to the code. Continue reading Cython Is As Good As Advertised
Emails are still one of the most important means of electronic communication. Apart of everyday usage with some convenient client ( like superb Thunderbird), from time to time one might need to get messages content out of the mailbox and perform some bulk action(s) with it – an example could be to download all image attachments from your mailbox into some folder – this can be done easily manually for few emails, but what if there is 10 thousands of emails? Your mailbox is usually hosted on some server and you can access it via IMAP protocol. There are many possible ways how to achieve this, however most of them require to download or synchronize full mailbox locally and then extract required parts from messages and process them. This could be very inefficient indeed. Recently I have a need for automated task like one above – search messages in particular IMAP mailbox, identify attachments of certain type and name and download then and run a command with them, after command is finished successfully delete email (or move it to other folder). Looking around I did not found anything suitable, which would meet my requirements (Linux, command line, simple yet powerful). So having some experiences with IMAP and python, I decided to write such tool myself. It’s called imap_detach, and you can check details on it’s page. Here I’d like to present couple of use cases for this tool in hope they might be useful for people with similar email processing needs.
Continue reading Download Email Attachments Automagically
From time to time one might need to write simple language parser to implement some domain specific language for his application. As always python ecosystem offers various solutions – overview of python parser generators is available here. In this article I’d like to describe my experiences with parsimonious package. For recent project of mine ( imap_detach – a tool to automatically download attachment from IMAP mailbox) I needed simple expressions to specify what emails and what exact parts should be downloaded. Continue reading Writing Simple Parser in Python
Although there is a fair choice of GUI libraries for Python (good overview of Python GUI libraries is here), sometimes we need just a little bit more enhanced terminal interface, like in my recent project – XMPP test client – where requirements were quite simple – just to split terminal screen into two areas – main screen where messages are displayed (possibly asynchronously) and bottom line, where commands/messages can be entered:
Continue reading Terminal Interfaces in Python
PaaS is happily buzzing in the Cloud and it seems to be hottest topic in the infrastructure services today, so I decided to test Openshift – PaaS offering from Red Hat. Couple of reasons make this platform interesting – firstly it’s open source solution, so we can use it to build your own private solution, secondly on public service we get 3 gears ( linux containers with predefined configuration) for free forever, so it’s easy to experiment with this platform. As a sample project we will create very simple Python Flask web application with MongoDb. Continue reading OpenShift Experiencies
As I’ve written video files can be streamed via Bit Torrent protocol. Although responsiveness (time to start, time to seek) is notably worst that in specialized solutions, it is still usable for normal user, with a bit of patience.
Video files are also provided by file sharing servers, but in many cases download rate is limited, so it’s not enough to stream video file. However it’s often possible to open several requests for same file, and combine download rate – this method is quite common in download managers. And if we add possibility to stream downloaded content to video player, we can achieve satisfactory results, possibly similar as or better then streaming via Bit Torrent. Continue reading Video Streaming from File Sharing Servers
When working on btclient, I was interested in possibility of downloading a subtitles for a video file, that is played. This seems to be common option in many player. I’ve found that opensubtitles.org provides XML-RPC remote API, which is very easy to use. With help of python
xmlrpclib module, it’s really a matter of minutes to create a simple working client. Continue reading OpenSubtitles provide easy to use API
In python newly created sub-process inherits file descriptors from parent process and these descriptors are left open – at least this was default till python ver. 3.3.
subprocces.Popen constructor has parameter
close_fds (defaults to False on python ver. 2.7), which can say if to close inherited FDs or not. Leaving them open FDs for child process can lead to many problems as explained here and here. Continue reading Subtle evil of close_fds parameter in subprocess.Popen
It’s not obvious to set it right, so I’m putting some notes here:
Installation is described here.
- ORACLE_HOME is needed just for installation
- If you add client library path to
/etc/ld.so.conf.d/oracle.conf and update
ldconfig, you don’t need to export modified LD_LIBRARY_PATH
- when you install Oracle client library and set environment, you can install cx_oracle also via
pip install cx_Oracle
The crucial step not mentioned in the installation guide is to set NLS_LANG environment variable – this should be in the environment of your python program using cx_oracle. So for instance for Flask+SQLAlchemy you can have:
Without this variable oracle client is using 7bits ASCII! So any unicode character will raise “UnicodeEncodeError: ‘ascii’ codec can’t encode character” error.