Boost C++ Libraries Home Libraries People FAQ More

PrevUpHomeNext

Parsing Expression Grammar

Parsing Expression Grammars (PEG) [5] are a derivative of Extended Backus-Naur Form (EBNF) [6] with a different interpretation, designed to represent a recursive descent parser. A PEG can be directly represented as a recursive-descent parser.

Like EBNF, PEG is a formal grammar for describing a formal language in terms of a set of rules used to recognize strings of this language. Unlike EBNF, PEGs have an exact interpretation. There is only one valid parse tree (see Abstract Syntax Tree) for each PEG grammar.

Sequences

Sequences are represented by juxtaposition like in EBNF:

a b

The PEG expression above states that, in order for this to succeed, b must follow a. Here's the syntax diagram:

sequence

Here's a trivial example:

'x' digit

which means the character x must be followed by a digit.

[Note] Note

In Spirit.X3, we use the >> for sequences since C++ does not allow juxtaposition.

Alternatives

Alternatives are represented in PEG using the slash:

a / b
[Note] Note

In Spirit.X3, we use the | for alternatives just as in EBNF.

Alternatives allow for choices. The expression above reads: try to match a. If a succeeds, success, if not try to match b. This is a bit of a deviation from the usual EBNF interpretation where you simply match a or b. Here's the syntax diagram:

alternative

PEGs allow for ambiguity in the alternatives. In the expression above, both a or b can both match an input string. However, only the first matching alternative is valid. As noted, there can only be one valid parse tree.

Loops

Again, like EBNF, PEG uses the regular-expression Kleene star and the plus loops:

a*
a+
[Note] Note

Spirit.X3 uses the prefix star and plus since there is no postfix star or plus in C++.

Here are the syntax diagrams:

kleene

plus

The first, called the Kleene star, matches zero or more of its subject a. The second, plus, matches one ore more of its subject a.

Unlike EBNF, PEGs have greedy loops. It will match as much as it can until its subject fails to match without regard to what follows. The following is a classic example of a fairly common EBNF/regex expression failing to match in PEG:

alnum* digit

In PEG, alnum will eat as much alpha-numeric characters as it can leaving nothing more left behind. Thus, the trailing digit will get nothing. Loops are simply implemented in recursive descent code as for/while loops making them extremely efficient. That is a definite advantage. On the other hand, those who are familiar with EBNF and regex behavior might find the behavior a major gotcha. PEG provides a couple of other mechanisms to circumvent this. We will see more of these other mechanisms shortly.

Difference

In some cases, you may want to restrict a certain expression. You can think of a PEG expression as a match for a potentially infinite set of strings. The difference operator allows you to restrict this set:

a - b

The expression reads: match a but not b.



[5] Bryan Ford: Parsing Expression Grammars: A Recognition-Based Syntactic Foundation, http://pdos.csail.mit.edu/~baford/packrat/popl04/

[6] Richard E. Pattis: EBNF: A Notation to Describe Syntax, http://www.cs.cmu.edu/~pattis/misc/ebnf.pdf


PrevUpHomeNext