Writing Simple Parser in Python

From time to time one might need to write simple language parser to  implement  some domain specific language for his application.  As always python ecosystem offers  various solutions – overview of python parser generators is available here.  In this article I’d like to describe my experiences with parsimonious package. For recent project of mine ( imap_detach –  a tool to automatically download attachment from IMAP mailbox) I needed simple expressions to specify what emails and what exact parts should be downloaded.

Requirements

I needed a parser for simple logical expressions, which use a set of predefined variables ( properties of email), like these:

mime = "image/jpg" & attached & ! seen

Meaning: Email part is jpg image and is added as attachment and have not been seen yet

name ~=".pdf" & ( from ~= "jack" | from ~= "jim" )

Meaning all email part where filename contains .pdf and is from jack or jim

Grammar

Parsimonious implements PEG grammar which enables to create very compact grammar definitions. Unlike some other parser generators, where grammar in expressed in Python, parsimonious  has it’s own syntax, which enables to create short and easy to overview grammar definitions:

GRAMMAR=r""" # Test grammar
expr = space or space
or = and   more_or
more_or = ( space "|" space and )*
and = term  more_and
more_and = ( space "&" space term )*
term = not / value
not = "!" space value
value =  contains  / equals / bracketed / name
bracketed = "(" space expr space ")"
contains  =  name space "~="  space literal
equals =  name space "="  space literal
name       = ~"[a-z]+"
literal    = "\"" chars "\""
space    = " "*
chars = ~"[^\"]*"
"""

There are  couple of things which has to be remembered, when creating grammar:

  • PEG grammar should avoid left recursion, so rules like
    and = expr space "&" space expr

    Don not work and will result in recursion error ( infinite recursion).  They have to be rewritten as indirect left recursion, which might be sometime challenging.

  • PEG grammar grammar are more deterministic that context free grammars –  so there is always one rule that is matched – there is no ambiguity like in  CFG.   This is assured by deterministic selection  / – always first match is selected and by greedy repetition operators * + ?  (they always match maximum possible length from input).
    Practically this means that rules has to be more carefully designed and they need to be design in particular way to assure required priority of operators.
  • Only named rules can have some special treatment when walking AST tree ( see below).  I think this is special feature of parsimonious,  but AST contains  nodes for a part of rule – like expressions that are in brackets.   This means if I have rule like this:
    and = term ( space "&" space term )*

    I have no control on evaluating the right part, so I rather split it into two rules.

Evaluation

Once we have the grammar, we can evaluate expressions by walking the parsed AST. Parsimonious provides nice support for this with visitor pattern . We can create subclass of parsimonious.NodeVisitor with visit_rulename methods for all (relevant) rules from the grammar.   visit method receives current node in AST and list of values from its already visited (evaluated ) children.

We will need some supporting code (error class and function to decode binary strings to unicode):

class EvalError(Exception):
    def __init__(self, text, pos=0):
        super(EvalError, self).__init__(text+ ' at position %d'%pos)

def decode(s):
    if isinstance(s, six.binary_type):
        return s.decode('UTF-8')
    return s

 

So here is example of NodeVisitor class, that evaluates our simple expressions:

class SimpleEvaluator(parsimonious.NodeVisitor):
    def __init__(self, ctx, strict=True):
        self.grammar = parsimonious.Grammar(GRAMMAR)
        self._ctx = ctx
        self._strict = strict
    
    def visit_name(self, node, chidren):
        if node.text in self._ctx :
            val=self._ctx[node.text]
            if isinstance(val, (six.string_types)+ (six.binary_type,)) :
                val = decode(val).lower()
            return val
        elif self._strict:
            raise EvalError('Unknown variable %s'%node.text, node.start)
        else:
            return ''
    
    def visit_literal(self,node, children):
        return decode(children[1]).lower()
        
    def visit_chars(self, node, children):
        return node.text
    
    def binary(fn):  # @NoSelf
        def _inner(self, node, children):
            if isinstance(children[0], bool):
                raise EvalError('Variable is boolean, should not be used here %s'% node.text, node.start)
            return fn(self, node, children)
        return _inner
    
    @binary
    def visit_contains(self, node, children):
        return children[0].find(children[-1]) > -1
    
    @binary
    def visit_equals(self, node, children):
        return children[0] == children[-1]
   
    def visit_expr(self, node, children):
        return children[1]
    
    def visit_or(self, node, children):
        return children[0] or children[1] 
    
    def visit_more_or(self,node, children):
        return any(children)
    
    def visit_and(self, node, children):
        return children[0] and (True if children[1] is None else children[1])
    
    def visit_more_and(self, node, children):
        return all(children)
        
    def visit_not(self, node, children):
        return not children[-1]
    
    def visit_bracketed(self, node, children):
        return children[2]
    
    def generic_visit(self, node, children):
        if children:
            return children[-1]

This class evaluates expression within context of defined variable’s values (dictionary). So we can parse and evaluate expression with one method:

context = {"name": "test.pdf", 
           "mime": "application/pdf",
           "from": "jim@example.com",
           "to": "myself@example.com",
           "attached": True,
           "seen": False
            }

parser=SimpleEvaluator(context)
result= parser.parse('name ~=".pdf" & ( from ~= "jack" | from ~= "jim" )')

 Conclusion

Parsimonious is a nice compact package for creating small parsers that are easy to use. For myself I bit struggled with PEG grammar,  but that’s probably due to my unfamiliarity with this type of grammars.  Once accustomed to it one can create more complex grammars.

 

Leave a Reply

Your email address will not be published. Required fields are marked *