The Different Approach to A Parser – Parser Combinators

Some time ago I’ve looked into building parser for simple logical language used by imap_detach Tool. I used there parsimonious module which uses PEG grammars. Recently I learned about another parsing technique – Parser Combinators. The idea comes from functional programming and is about combining (with help of higher order functions – so called combinators) simple parsing functions, which parse primitive tokens, into more complex parsing functions, which parse parts of text and further combine those into more complex ones, ending finally with one function, which parses all given data. I first met with parser combinators in Rust library nom. Here parsers and  combinators are express as macros, which is convenient on one side (concise syntax), but can lead to pretty cryptic error messages and one cannot rely much on editor’s help with auto-completions, context help etc..  I used nom to build very simple URI parser.  I was wondering also if parser combinators are available for Python and how they would compare with other approaches – like above mentioned PEG grammars.

Looking around I found parsec – Python parser combinators library inspired by well know Haskell library.  Parsec is a small library with as much of functional style one can get out of Python.  You can check example for JSON parser.

In my test I wanted to create a parser for same logical expressions as in previous  article – so we can parse and evaluate following expressions into either True or False value (assuming names are replaced with actual values):

name ~=".pdf" & from ~= "jack" | ! attached
name ~=".pdf" & ( from ~= "jack" | from ~= "jim")


After playing a bit with the parsec library I started to construct parsers and hit obvious problem (similarly as in initial trials with parsimonious) –  left recursion – though there are some advanced parser combinators that can handle left recursion with help of memoization,  Python parsec is a simple library, so combinators with left recursion have to avoided.  Below is the final code that works similarly to parsimonious grammar:

from parsec import *

parsing_ctx = {}
strict = False

class ContextError(Exception):

def ctx_lookup(name):
        return parsing_ctx[name]
    except KeyError:
        if strict:
            raise ContextError("Name %s not is context" % name)
        return ''

def skip(p): return p << spaces()

left_bracket = skip(string('('))
right_bracket = skip(string(')'))
quote = string('"')
or_op = skip(string('|'))
and_op = skip(string('&'))
not_op = skip(string('!'))
contains_cmp = skip(string('~='))
eq_cmp = skip(string('='))
name = skip(regex('[a-z]+')).parsecmap(ctx_lookup)

def literal():
    yield quote
    text = yield many(none_of('"'))
    yield quote
    return ''.join(text)

def bracketed():
    yield left_bracket
    exp = yield expression
    yield right_bracket
    return exp

def equals():
    var = yield name
    yield eq_cmp
    val = yield literal
    return var == val

def contains():
    var = yield name
    yield contains_cmp
    val = yield literal
    return var.find(val) > -1

def not_exp():
    yield not_op
    val = yield simple ^ bracketed
    return not val

ops = or_op | and_op
simple = equals ^ contains ^ name
composed = simple ^ not_exp ^ bracketed

def reduce_ops(val1):
    def rp(text, index):
        curr_values = [val1]
        curr_index = index
        while True:
            res = ops(text, curr_index)
            if not res.status:
                # there is no further operation - stop parsing
                return Value.success(curr_index, any(curr_values))
                curr_index = res.index
                op = res.value
                res = composed(text, curr_index)
                if res.status:
                    if op =='&':
                        curr_values[-1] = curr_values[-1] and res.value
                    curr_index = res.index
                    return res
    return rp

expression = composed.bind(reduce_ops)

Parsers in parsec are functions which take two arguments – index where to start and text to parse – and they return Value type – which contains either parsed value and index where next parser can continue or, in case of error, some description of expected input.  Combinator then takes parser functions as arguments and returns a new parser function, which applies parsers in some useful combination –  look for instance at skip function – it uses combinator skip implemented as  << (overloaded operator ) – it returns value only from first parser and second parser just eats characters from input – in our case non-significant spaces.

Parsec has one very cool feature –  it can turn Python generator into parser. Generator must yields parsers, which are subsequently applied and their value is returned  back to generator – thus final value can be calculated very easily – look for instance to equals functions.

Final piece is function reduce_ops, which creates a parser that parses zero or more following binary operations (ands / ors) and reduces their operands into one single value – first ands as they have higher precedence then ors.

Big advantage of parser combinators is that they are built from bottom up and each parser can be tested independently so you can make unit tests for intermediate parsers to assure that they are working correctly.

I made quick comparison of performance for both parsers – old with parsimonious grammar and new with parsec combinators (1000 iterations of two simple expressions shown above):

Old with grammar compilation in each iteration  4.830 secs
Old with grammar compilation once before iterations 0.563 secs
New 0.536 secs

as you can see in parsimonious grammar compilation takes significant time, but once compiled it is almost as fast as parsec. Concerning code size both examples were similar in size ~ 100 lines of code.

Conclusions? Nothing particular – parser combinators are fun, you can try them yourself, if you like functional programing. If code is reasonably organized they can still provide comprehensible overview of the underlying “grammar”.

Leave a Reply

Your email address will not be published. Required fields are marked *