Some time ago I’ve looked into building parser for simple logical language used by imap_detach Tool. I used there parsimonious module which uses PEG grammars. Recently I learned about another parsing technique – Parser Combinators. The idea comes from functional programming and is about combining (with help of higher order functions – so called combinators) simple parsing functions, which parse primitive tokens, into more complex parsing functions, which parse parts of text and further combine those into more complex ones, ending finally with one function, which parses all given data. I first met with parser combinators in Rust library nom. Here parsers and combinators are express as macros, which is convenient on one side (concise syntax), but can lead to pretty cryptic error messages and one cannot rely much on editor’s help with auto-completions, context help etc.. I used nom to build very simple URI parser. I was wondering also if parser combinators are available for Python and how they would compare with other approaches – like above mentioned PEG grammars.
Looking around I found parsec – Python parser combinators library inspired by well know Haskell library. Parsec is a small library with as much of functional style one can get out of Python. You can check example for JSON parser.
In my test I wanted to create a parser for same logical expressions as in previous article – so we can parse and evaluate following expressions into either True or False value (assuming names are replaced with actual values):
name ~=".pdf" & from ~= "jack" | ! attached
name ~=".pdf" & ( from ~= "jack" | from ~= "jim")
After playing a bit with the parsec library I started to construct parsers and hit obvious problem (similarly as in initial trials with parsimonious) – left recursion – though there are some advanced parser combinators that can handle left recursion with help of memoization, Python parsec is a simple library, so combinators with left recursion have to avoided. Below is the final code that works similarly to parsimonious grammar:
from parsec import * parsing_ctx = {} strict = False class ContextError(Exception): pass def ctx_lookup(name): try: return parsing_ctx[name] except KeyError: if strict: raise ContextError("Name %s not is context" % name) return '' def skip(p): return p << spaces() left_bracket = skip(string('(')) right_bracket = skip(string(')')) quote = string('"') or_op = skip(string('|')) and_op = skip(string('&')) not_op = skip(string('!')) contains_cmp = skip(string('~=')) eq_cmp = skip(string('=')) name = skip(regex('[a-z]+')).parsecmap(ctx_lookup) @skip @generate def literal(): yield quote text = yield many(none_of('"')) yield quote return ''.join(text) @generate def bracketed(): yield left_bracket exp = yield expression yield right_bracket return exp @generate def equals(): var = yield name yield eq_cmp val = yield literal return var == val @generate def contains(): var = yield name yield contains_cmp val = yield literal return var.find(val) > -1 @generate def not_exp(): yield not_op val = yield simple ^ bracketed return not val ops = or_op | and_op simple = equals ^ contains ^ name composed = simple ^ not_exp ^ bracketed def reduce_ops(val1): @Parser def rp(text, index): curr_values = [val1] curr_index = index while True: res = ops(text, curr_index) if not res.status: # there is no further operation - stop parsing return Value.success(curr_index, any(curr_values)) else: curr_index = res.index op = res.value res = composed(text, curr_index) if res.status: if op =='&': curr_values[-1] = curr_values[-1] and res.value else: curr_values.append(res.value) curr_index = res.index else: return res return rp expression = composed.bind(reduce_ops)
Parsers in parsec are functions which take two arguments – index where to start and text to parse – and they return Value type – which contains either parsed value and index where next parser can continue or, in case of error, some description of expected input. Combinator then takes parser functions as arguments and returns a new parser function, which applies parsers in some useful combination – look for instance at skip
function – it uses combinator skip implemented as << (overloaded operator ) – it returns value only from first parser and second parser just eats characters from input – in our case non-significant spaces.
Parsec has one very cool feature – it can turn Python generator into parser. Generator must yields parsers, which are subsequently applied and their value is returned back to generator – thus final value can be calculated very easily – look for instance to equals
functions.
Final piece is function reduce_ops
, which creates a parser that parses zero or more following binary operations (ands / ors) and reduces their operands into one single value – first ands as they have higher precedence then ors.
Big advantage of parser combinators is that they are built from bottom up and each parser can be tested independently so you can make unit tests for intermediate parsers to assure that they are working correctly.
I made quick comparison of performance for both parsers – old with parsimonious grammar and new with parsec combinators (1000 iterations of two simple expressions shown above):
Old with grammar compilation in each iteration | 4.830 secs |
---|---|
Old with grammar compilation once before iterations | 0.563 secs |
New | 0.536 secs |
as you can see in parsimonious grammar compilation takes significant time, but once compiled it is almost as fast as parsec. Concerning code size both examples were similar in size ~ 100 lines of code.
Conclusions? Nothing particular – parser combinators are fun, you can try them yourself, if you like functional programing. If code is reasonably organized they can still provide comprehensible overview of the underlying “grammar”.