My professional work is focused on making the lives of engineers building software better. I got into that area because I was fascinated – or even obsessed – with programming languages. That led me to the more general area of engineering tools of all types. But programming languages and how they’re implemented remain one of my great fascinations.
If you’re trying to implement a new programming language (or re-implement an old one), the first thing you need to do is called parsing. Parsing consists of reading in a file containing a bunch of text, making sure that it’s correctly structured, and then using its structure to translate it into some kind of data structure.
For example, I implemented a silly programming language called Cruise. In Cruise, a snippet from a program looks like:
actor !Adder { behavior :Add() { on Plus(Z(),$x, $target) { send $x to $target } on Plus(I($x), $y, $target) { send Plus($x,I($y), $target) to $self } } initial :Add }
In the Cruise implementation, the parser reads that input, and translates it into a structure called an abstract syntax tree. For the code in the example above, a part of the syntax tree for that code looks something like the following:
Actor[ name="actor" behaviors=[ Behavior[ pattern=Tuple[ name="Plus", Tuple[name="Z"], Var[name="x"], Var[name="target"]], action=[Send[Var["x"], Var["Target"]], Behavior[ pattern=Tuple[ name="Plus", Tuple[name="I", Var["x"]], Var["y"], Var["target"]], action=[ Send[ Tuple[name="Plus", Var["x"], Tuple[name="I", Var["y"]]], Var["target"]]]]]]
When you want to write a parser, you need to start by describing
what you want to parse. Most of the time, we do that using
a context free grammar (CFG). A CFG is a set of rules that
describe how to produce texts with a particular structure. For
example, suppose you wanted to implement a simple calculator (one
of the canonical examples used when teaching parsing!). You’d
need to describe what the expressions you want to parse
look like. A pretty typical version of this would be:
expr := add_expr add_expr := mult_expr ( ( '+' | '-' ) mult_expr)* mult_expr := unary_expr ( ('*' | '/') unary_expr)* unary_expr := '-' simple | simple simple := number | '(' expr ')' number: digit+
This grammar is a way of expressing a system for producing valid expressions. The easiest way to see how it works is to look at an example. We know that something like “3 + 4 * (-5 + 6)” is an expression. So how would be produce that using the grammar?
- Start with expr. That always expands to add_expr.
- Expand add_expr to mult_expr ‘+’ mult_expr.
- Expand the first instance of mult_expr to unary_expr, giving us “unary_expr+mult_expr“.
- Expand the first unary expr to simple: “simple+mult_expr“.
- Expand simple to number: “number+mult_expr“.
- Expand number to the number 3: “3+mult_expr“.
- Expand the mult_expr to “unary_expr*unary_expr“: “3+unary_expr*unary_expr“.
- Expand the first unary_expr to simple, and simple to number, and number to “4”: “3+4*unary_expr“.
- Expand the remaining unary_expr to simple, and simple to a parenthesized expression: “3+4*(expr)”.
- And so on, until we end up with “3+4*(-5+6)”.
When I started learning to write compilers, there were two ways that people wrote parsers. They either implemented them manually, using a technique called recursive descent, or they used a table-driven parser generator like yacc, where you wrote the grammar using a special syntax, and then a tool translated it into the implementation of a parser.
Writing a parser using recursive descent was a huge pain. It’s very tedious and repetitive, and like any meticulous process, it’s very error-prone. In short, it’s no damn fun at all.
Writing a parser using a generator was usually much easier – but if something didn’t work, it would be a huge pain. The problem is that the code you write doesn’t end up looking very much like the code you run. For example, to write our expression language, in a typical parser generator, you’d write it something like the follows. (The grammar gets re-arranged a bit, to fit the particular underlying parser algorithm that the generator uses).
%start expr expr: add_expr; add_expr: add_expr '+' mult_expr | add_expr '-' mult_expr | mult_expr ; mult_expr: mult_expr '*' unary_expr | mult_expr '/' unary_expr | unary_expr ; unary_expr: '-' simple | simple ; simple: number | '(' expr ')' ;
The parser generator reads the grammar that we wrote, and translates it into a set of lookup tables, which get interpreted by semi-generic parser code provided by the generator toolkit. What that means is that when your parser actually executes, there’s nothing called unary_expr. There’s some set of states encoded into magic variables with names like yytable
and yytrans
. If you got something wrong in your parser, debugging it is painful at best, because the grammar is gone: it’s been translated into something completely incomprehensible to human eyes.
Not all parser generators generate code that’s this extreme when it comes to readability. But it’s the output from the most widely used one, called YACC, which has been translated and used by C/C++, Caml, Go, Fortran, Pascal, D, SML, and more.
But even if you’re not using YACC, it’s not a good scene. Even ANTLR, my favorite parser generator, which makes a heroic effort to generate efficient but human-readable parsers, the fact remains that the code that you’re running and trying to debug isn’t familiar to you: it’s not what you wrote. You’re still stuck trying to work through an unfamiliar translation.
Honestly, both approaches – the manual parser, and the parser generator – suck an awful lot of the time.
Then, a few years ago, a gang of functional programmers came up with a truly brilliant scheme. You can describe a parser as a special kind of function from inputs to outputs. If you do that, you can also combine parsers using higher-order functions. Using those functions, all you need to do is write tiny parsers for the simplest elements of your language, and then you can build the parser for your language by just combining your simple parsers in different ways. When you implement a parser this way, what you get is very close to the original grammar of the language, but it’s an actual executable program: the code that you need to read and debug is exactly the code that you wrote.
For example, suppose I wanted to parse that expression language. Using a Python version of parser combinators, I can write:
digit = Parser.match(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']) number = digit.many(1) parens = Parser.match(['(']) & Reference('expr') & Parser.match([')']) simple = number | parens unary_expr = Parser.match(['-']).opt() & simple mult_expr = unary_expr & (Parser.match(['*', '/']) & unary_expr).many() add_expr = mult_expr & (Parser.match(['-', '+']) & mult_expr).many() expr = add_expr Reference.register_named_parser('expr', add_expr)
That’s not a contrived example. That’s actually a real, honest-to-god parser for our expression language! We’ll look at it in more detail later. But first, we need to understand what parsers and parser combinators really are, and how they work.
At this point, it’s time to start getting more concrete. Just what is a parser, precisely?
In keeping with the fact that this approach came from the functional programming community, it makes sense to describe a parser as a function. If you think of it this way, a parser is a special kind of function:
- It takes a sequence of inputs;
- It can either succeed or fail;
- If it succeeds, it consumes part of its input, and produces an output.
That means that we can come up with a sort-of type signature pretty easily. It’s a function from an input to a result value. The result value can be two types: either a success, or a failure. If it’s a success, it contains both the output value produced by the parser and the unconsumed part of the input. If it fails, the failure contains nothing.
In Python, that’s easy to represent in several different ways. We’ll do it in the one that’s closest to a typed approach: an abstract class ParseResult
, with subclasses for success or failure.
class ParseResult(object): def succeeded(self): return False class Success(ParseResult): """A successful parse result, containing the value produced by a parser, and the unconsumed part of its input. """ def __init__(self, output, rest): self.output = output self.rest = rest def succeeded(self): return True class Failure(ParseResult): def __init__(self): pass
The inputs are just a sequence of values. The parsers can look at one element of the input at a time. In Python, we’ll represent them using a class.
class ParserInput(object): @abstractmethod def at_end(self): """Return True if there's no remaining input.""" def first(self): """Return the first element of this input.""" def rest(self): """Return a ParserInput consisting of the remainder after the first element has been consumed. """
With those definitions, a parser becomes an object with one method, which takes a ParserInput
, and produces a ParseResult
.
class Parser(object): def parse(self, inp): """inp should be an instance of ParserInput; returns an instance of ParseResult"""
Before we get to the combinators, let’s look at an example of a simple parser. Suppose we want to accept a digit. We can implement that as:
class DigitParser(Parser): def parse(self, inp): if inp.first().isdigit(): return Success(inp.first(), inp.rest()) else: return Failure()
This looks at the first character of the input stream. This is mostly functional, so it does this without altering the input stream. If the first character is a digit, then it succeeds, returning the digit character and the remaining input; otherwise, it fails.
Simple code like that little snippet are all you need to write yourself. It’s so incredibly simple that looking at it, it seems like it’s almost too simple. How on earth can anything this trivially simple actually be the basis of something as powerful as the infamous YACC?
The answer is combinators: functions that combine parsers togethers. This simple basic, functional mechanism makes it possible to combine simple parsers using simple functions – producing a limitless variety of parsers, which can parse pretty much anything you choose to throw at them. So what kinds of combinators do we need? There are four basics that I used in my implementations, and one that’s an artifact of the implementation.
- Sequence: Parse an “A” followed by a “B”. In my Python implementation, this is written
A & B
. It succeeds only if both “A” and “B” succeed in sequence. The output of this is a list containing the outputs from “A”, “B”, and whatever else is in the sequence with them. - Choice: Parse either an “A” or if that fails, parse a “B”. That’s written
A | B
. The output is whichever one succeeds, or else fails. - Option: Parse either an “A”, or nothing, written
a.opt()
. The output is either the output from parsing an “A”, orNone
. - Repetition: Parse more than one “A”, written
a.many(min)
. The result is a list of the outputs of the invocations of “A”. - Reference: I’m going to tell you what kind of thing to parse here later. For now, it’s just named “A”. The output is, of course, whatever the referenced parser outputs.
The last one definitely needs some extra explanation. Most interesting grammars are, ultimately, recursive. In our calculator’s expression grammar, the rule expr contains an add_expr, which contains a mult_expr, which contains a unary_expr, which contains a simple, which could expand to something that contains an expr. That leads to a problem constructing our parsers: we can’t actually build an instance of an expr, because it needs to contain itself. The Reference combinator is the solution to this: it lets you say Call the parser named “A”; I’ll tell you what parser that is later. If you look at the parser implementation I showed earlier, you can see this:
parens = Parser.match(['(']) & Reference('expr') & Parser.match([')']) ... Reference.register_named_parser('expr', add_expr)
In the definition of the parens rule, we used Reference('expr')
, which says “call whatever parser is registered with the name ‘expr'”. Then later on, we registered the add_expr
parser with the name ‘expr’. Voila! Recursion.
There’s one more thing that I used in the calculator example: Parser.match(). That’s just a helpful little piece of code that I wrote to create parsers that look for one character from a fixed set. It’s implemented with a class named SetParser
. Parser.match(['('])
will only match a “(” in the input; Parser.match(['a', 'b', 'c'])
will match either an “a”, a “b”, or a “c”. Implementing it is as easy as you’d guess at this point:
class SetParser(Parser): def __init__(self, chars): self.chars = chars def pr(self): return "SetParser%s" % self.chars def parse(self, inp): if inp.first() in self.chars: return Success(inp.first(), inp.rest()) else: return Failure()
With these, you should now be able to completely read and understand the implementation of our expression parser. Let’s walk through it.
digit = Parser.match(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']) number = digit.many(1) parens = Parser.match(['(']) & Reference('expr') & Parser.match([')']) simple = number | parens unary_expr = Parser.match(['-']).opt() & simple mult_expr = unary_expr & (Parser.match(['*', '/']) & unary_expr).many() add_expr = mult_expr & (Parser.match(['-', '+']) & mult_expr).many() expr = add_expr Reference.register_named_parser('expr', add_expr)
- digit: this is a SetParser that reads a single digit.
- number: a number is just a sequence of at least one digit.
- parens: this is a parenthesized expression: an open paren followed by an expression, followed by a close paren.
- simple: a simple expression is either a number, or a parenthesized expression.
- unary: a unary expression is an optional “-“, followed by a simple.
- mult_expr: a multiplication expression is a unary, followed by a collection of multiplication operators and operands.
- add_expr: the same thing as a
mult_expr
, except that it uses add operations. By defining them seperately, with multiplication embedded inside addition, we gave multiplication a higher precedence. - expr is just an alias for
add_expr
. - Reference: finally, we bind the reference from the parenthesized expression.
Now that we have a parser, we can run it!
>>> inp = StringParserInput("1+2*(3+5*4)*(6+7)") >>> print(expr.parse(inp).output) [None, ['1'], [], [['+', [None, ['2'], [['*', [None, ['(', [None, ['3'], [], [['+', [None, ['5'], [['*', [None, ['4']]]]]]]], ')']]], ['*', [None, ['(', [None, ['6'], [], [['+', [None, ['7'], []]]]], ')']]]]]]]]
It works, but wow, that output is horrible! It’s basically a tree, formed from lists, that describes the parse process. It’s got all of the information we’d need to evaluate the expressions parsed by our grammar, only it’s in a terrible form.
That brings us to one last thing to add: Actions. In terms of parser combinators, an action is a parser that takes another parser, and a function. The output of the action is the result of applying the function to the embedded parser:
class Action(Parser): def __init__(self, parser, act): self.parser = parser self.action = act def pr(self): return "Action[%s]" % self.parser.pr() def parse(self, inp): result = self.parser.parse(inp) if result.succeeded(): return Success(self.action(result.output), result.rest) else: return Failure()
Using that, we can turn our expression parser into a calculator!
def digits_to_number(digits, running=0): """Convert a list of digits to an integer""" if len(digits) == 0: return running else: r = (running * 10) + int(digits[0]) return digits_to_number(digits[1:], r) digit = Parser.match(['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']) number = Action(digit.many(1), digits_to_number)
In this version, number
now actually outputs an integer, which is the number represented by the list of digits produced by the parser.
We can attach similar actions to all of the other rules. For example, the new version of our addition expression is:
def eval_add(lst): """Evaluate an addition expression. For addition rules, the parser will return [number, [[op, number], [op, number], ...]] To evaluate that, we start with the first element of the list as result value, and then we iterate over the pairs that make up the rest of the list, adding or subtracting depending on the operator. """ first = lst[0] result = first for n in lst[1]: if n[0] == '+': result += n[1] else: result -= n[1] return result add_expr = Action(mult_expr & (Parser.match(['-', '+']) & mult_expr).many(), eval_add)
With similar actions attached to the others, now, it works. (The code for the other actions is in the github repository, linked below):
>>> inp = StringParserInput("1+2*(3+5*4)*(6+7)") >>> print(expr.parse(inp).output) 599
That’s about it for doing this in Python. The code is all here on Github, in the “python” subdirectory. The rest of the github project is the Java version. It’s more complicated to do all of this in Java, because of the type system. We’ll take a look at that in the next post. The syntax is more verbose, and the types make things a bit more complicated, but it’s still amazingly easy to build a useful parser this way.
This implementation isn’t quite what I’d call industrial strength:
- It’s not super-efficient; particularly when it’s parsing a multiple-precedence expression, it can do a whole lot of backtracking, which makes things very slow. To fix this, the best thing would be to add something called operator precedence parsing for handling infix expressions. Parsec, the Haskell library that introduced my to combinator parsing, does this, and it workes beautifully. It would be nice to add that, but for my purposes (ie, writing this blog post, plus a bit of language parsing for my Apex project), it’s not necessary.
- It has terrible error handling. In fact, it doesn’t have error handling at all. If there’s a syntax error in your input, all that will happen is that the parser will fail. To fix this, I’ll need to add a third kind of return value to the basic parser function: a parser should be able to succeed, fail without error, and fail with an error. This is something I really need to add if I ever want to use this code for anything even semi-real.
- It doesn’t track positions. In other words, when I get it to the point where it can actually produce a syntax error, it can’t say where the error is. I’d really like to be able to say “There’s an error at line 3 column 7”, but right now, there’s no way to do that. This one is relatively easy to add: the parser input just needs to be able to report a line and column number. I’m definitely planning on adding that.