Tag Archives: python

Run and monitor tasks via WebSocket with ASEXOR

Many modern web applications require more then just displaying data in the browser.  Data may need to be processed and transformed in various ways, which require intensive processing tasks on server side. Such processing is best done asynchronously outside of web application server, as such tasks can be relatively  long running. There are already many existing solutions for asynchronous task scheduling, some of them are quite sophisticated general frameworks like Celery, Kafka, others are build in features of application servers ( like mules and spoolers in uWSGI).  But what if we need something simpler, which can work  directly with Javascript clients and is super simple to use in a project.  Meet asexor – ASynchronous EXecuOR,  a small project of mime. Continue reading Run and monitor tasks via WebSocket with ASEXOR

Comparison of JSON Like Serializations – JSON vs UBJSON vs MessagePack vs CBOR

Recently I’ve been working on some extensions to ASEXOR, adding there direct support for messaging via WebSocket and I use JSON for small messages that travels between client (browser or standalone)  and backend.  Messages looks like these:

I wondered, if choosing different serialization format(s) (similar to JSON, but binary) could bring more efficiency into the application –  considering  both message size and encoding/decoding processing time.  I run small tests  in python (see tests here on gist) with few established serializers, which can be used as quick replacement for JSON and below are results: Continue reading Comparison of JSON Like Serializations – JSON vs UBJSON vs MessagePack vs CBOR

Asyncio Proxy for Blocking Functions

File operations and other IO operations can block asyncio loop and  unfortunately  python does not support true asynchronous disk operations (mainly due to problematic state of async disk IO in underlying os – aka linux – special library is need for true asynchronous disk operations  so normally select (or other IO event library) always reports file as ready to read and write and thus file IO operations block). Current solution is to run such operations in thread pool executor. There is asyncio wrapper library for file object – aiofiles, but there are also many blocking functions in other python modules – like os, shutil etc.  We can easily write wrappers for such methods, but it can be annoying and time consuming if we use many of such methods.   What about to write a generic proxy, which will assure that methods are executed in thread pool and use this proxy for all potentially blocking methods within the module. Continue reading Asyncio Proxy for Blocking Functions

Revival of Neural Networks

My actual master studies topic was AI (more then 20 years ago).   Artificial Neural Networks (ANNs) were already known and popular branch of AI and we had some introduction to basics of artificial neural networks (like perceptron, back propagation, etc.). Though it was  quite interesting topic, I had not seen many practical applications in those days.  Recently  I’ve chatted with old friend of mine,  who stayed in university and is involved in computer vision research, about recent advancements in AI and computer vision and he told me that most notable change in last years was that neural networks are being now used in large scale. Mainly due to increase in computing power neural networks now can are applied to many real world problems. Another hint about popularity of neural networks came from my former boss, who shared with me this interesting article –  about privacy issues related to machine learning. I’ve been looking around for a while and it looks like neural networks are becoming quite popular recently – especially architectures with many layers used in so called deep learning. In this article I’d like to share my initial experiences with TensorFlow, open source library (created by Google), which can be used to build modern, multi-layered neural networks. Continue reading Revival of Neural Networks

WAMP Is WebSocket on Steroids

If you look for WAMP abbreviation over Internet, you will probably find that WAMP = Windows + Apache + MySQL + PHP – which was popular web stack some time ago (but who wants to run web server on Windows today?  And other components  now also have  viable alternatives).   But in this article I’d like to talk about WAMP = Web Application Messaging Protocol.  WAMP is available  as WebSocket subprotocol, but also can work on plain TCP or Unix domain sockets. Continue reading WAMP Is WebSocket on Steroids

Run Multiple Terminal Tabs with Python Virtualenv

Virtualenv is a must have for python development.  If your project is a complex beast consisting  of multiple services/components you want them see running  in different terminals  (ideally tabs of one terminal window).  Staring all terminal manually could be cumbersome. This simple script starts terminal tabs (in gnome-terminal) with activated virtual environments and eventually appropriate services/applications started:

None -ci argument –  interactive shell must be enforced to run command with virtual environment loaded.w

Also gnome terminal recently drop support for –title parameter, which enabled to set title to the tab (really do not understand why, because it was very useful).   So now our tabs will have same prompt.

This can be somehow fixed with modification of virtualenv activate script to include terminal escape sequence  shown below (thus we will see current terminal directory as tab title):

 

Functional Fun with Asyncio and Monads

Python 3.4+ provides excellent Asyncio library for asynchronous tasks scheduling and asynchronous I/O operations.   It’s similar to gevent, but here tasks are implemented by generator based coroutines.  Asynchronous I/O is useful for higher I/O loads, where it usually achieves better performance and scalability then other approaches (threads, processes). About a year ago I played with OCaml, where light weight threads/ coroutines and asynchronous I/O  approaches  are also very popular (Ocaml has same limitation for threading as Python – a global lock) and there were two great libraries – lwt and core async.  Both libraries use monads as a programming style to work with asynchronous tasks. In this article we will try to implement something similar on basis of asyncio library. While our solution will  probably not provide “pure” monads it’ll still be fun and we’ll learn something about asyncio. Continue reading Functional Fun with Asyncio and Monads

SQL or NoSQL – Why not to use both (in PostgreSQL)

NoSQL databases have become very popular in last years and there is a plenty of various options available. It looks like traditional relational databases (RDBMs) are almost not needed any more. NoSQL solutions are advertised as faster, more scalable and easier to use. So who would care about relations, joins, foreign keys and similar stuff (not talking about ACID properties, transactions, transaction isolation)? Who would,  if NoSQLs can make your life much easier. But there is a key insight about NoSQL databases – their wonderful achievements are possible because they made their life easier too is some aspects. But that comes with some price – would you be happy, if your bank will store your saving in MongoDb?

However there are many environments, where NoSQL databases shine – especially when there are huge amounts of simple data structures, which need to be scaled massively across the globe and where these data are not of much value – solutions like social networks, instant messaging etc. are not so much concerned about data consistency or data loss, because these data are basically valueless. (Their business model is just based on sharing absolutely trivial data, where one piece can be easily replaced with another and it does not matter if some pieces are lost. Consider – what will happen if whole Facebook will go away in one minute? Nothing! Few people will be pissed off because they think their online profile was cool, few sad that they cannot share their meaningless achievements with so called ‘friends’, but generally considered nothing special will happen and no real value will be lost. People will just switch to another provider and fill it’s database with tons of trivialities and will easily forget about data in their previous account).

I don’t want to create impression that NoSQL databases are useless, they are very good for certain scenarios (and we need to remember that NoSQL is rather broad category, it includes structured documents stores, key-value stores, object databases etc. – each one has it’s particular niche, where it excels), but relational databases are also good, actually very good. Relational model is fairly good abstraction of very many real world situations, data structures, entities, however we call them. And relational databases provide solid tools to works with them. So it make sense to use them in many cases. It might bit more difficult to start with relational database then with schema-less document store, but  in the long run it should pay off. And what is really nice it’s not about one or another solution, but we can use both and combine them smartly and inventively.
So enough of general mumbo jumbo – let’s get to my particular case – I’ve been looking for data store for my new project and considered to try MongoDb this time ( while in past I stuck to relational DBs), however finally decided for PostgreSQL (again) – and I’d like to share some tests, findings and thoughts. Continue reading SQL or NoSQL – Why not to use both (in PostgreSQL)

Parsing PDF for Fun And Profit (indeed in Python)

PDF documents are ubiquitous in today’s world. Apart of common use cases of printing, viewing etc. we need sometimes do something specific with them- like convert tehm to other formats or extract textual content.  Extracting text from PDF document can be (surprisingly) hard task due to the purpose and design of PDF documents.  PDF is intended to represent exact visual representation of document ‘s pages down to the smallest details. And internal representation of document text is following this goal.  Rather the storing text in some logical units (lines, paragraphs, columns, tables …), text is represented as series of commands, which print characters (can be a single character, word, part of line, …) at exact position on the page with given font, font size, color, etc.   In order to reconstruct original text logical structure program  has to scan  all these commands and join together texts, which were probably forming same line or same paragraph.  This task can be pretty demanding and ambiguous –  mutual position of text boxes can be interpreted in various ways ( is this space between words too large because they are in different columns or line is justified to both ends?).

So the task of text extraction looks quite discouraging to try, luckily some smart guys have tried it already and left us with libraries that are doing pretty good job and we can leverage them. Some time ago I’ve created tool called PDF Checker, which does some analysis of PDF document content (presence, absence of some phrases,  paragraphs numbering, footers format etc.). I used there excellent Python PDFMiner library.   PDFMiner is a grea tool and it is quite flexible, but being all written in Python it’s rather slow.   Recently I’ve been looking for some alternatives,  which have Python bindings and provide functionality similar to PDFMiner.  In this article I describe some results of this search, particularly my experiences with libpoppler. Continue reading Parsing PDF for Fun And Profit (indeed in Python)