The Different Approach to A Parser – Parser Combinators

Some time ago I’ve looked into building parser for simple logical language used by imap_detach Tool. I used there parsimonious module which uses PEG grammars. Recently I learned about another parsing technique – Parser Combinators. The idea comes from functional programming and is about combining (with help of higher order functions – so called combinators) simple parsing functions, which parse primitive tokens, into more complex parsing functions, which parse parts of text and further combine those into more complex ones, ending finally with one function, which parses all given data. I first met with parser combinators in Rust library nom. Here parsers and  combinators are express as macros, which is convenient on one side (concise syntax), but can lead to pretty cryptic error messages and one cannot rely much on editor’s help with auto-completions, context help etc..  I used nom to build very simple URI parser.  I was wondering also if parser combinators are available for Python and how they would compare with other approaches – like above mentioned PEG grammars.

Looking around I found parsec – Python parser combinators library inspired by well know Haskell library.  Parsec is a small library with as much of functional style one can get out of Python.  You can check example for JSON parser.

In my test I wanted to create a parser for same logical expressions as in previous  article – so we can parse and evaluate following expressions into either True or False value (assuming names are replaced with actual values):

name ~=".pdf" & from ~= "jack" | ! attached
name ~=".pdf" & ( from ~= "jack" | from ~= "jim")

 

After playing a bit with the parsec library I started to construct parsers and hit obvious problem (similarly as in initial trials with parsimonious) –  left recursion – though there are some advanced parser combinators that can handle left recursion with help of memoization,  Python parsec is a simple library, so combinators with left recursion have to avoided.  Below is the final code that works similarly to parsimonious grammar:

Parsers in parsec are functions which take two arguments – index where to start and text to parse – and they return Value type – which contains either parsed value and index where next parser can continue or, in case of error, some description of expected input.  Combinator then takes parser functions as arguments and returns a new parser function, which applies parsers in some useful combination –  look for instance at skip function – it uses combinator skip implemented as  << (overloaded operator ) – it returns value only from first parser and second parser just eats characters from input – in our case non-significant spaces.

Parsec has one very cool feature –  it can turn Python generator into parser. Generator must yields parsers, which are subsequently applied and their value is returned  back to generator – thus final value can be calculated very easily – look for instance to equals functions.

Final piece is function reduce_ops, which creates a parser that parses zero or more following binary operations (ands / ors) and reduces their operands into one single value – first ands as they have higher precedence then ors.

Big advantage of parser combinators is that they are built from bottom up and each parser can be tested independently so you can make unit tests for intermediate parsers to assure that they are working correctly.

I made quick comparison of performance for both parsers – old with parsimonious grammar and new with parsec combinators (1000 iterations of two simple expressions shown above):

Old with grammar compilation in each iteration  4.830 secs
Old with grammar compilation once before iterations 0.563 secs
New 0.536 secs

as you can see in parsimonious grammar compilation takes significant time, but once compiled it is almost as fast as parsec. Concerning code size both examples were similar in size ~ 100 lines of code.

Conclusions? Nothing particular – parser combinators are fun, you can try them yourself, if you like functional programing. If code is reasonably organized they can still provide comprehensible overview of the underlying “grammar”.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">