Home | Libraries | People | FAQ | More |
Parsing Expression Grammars (PEG) [5] are a derivative of Extended Backus-Naur Form (EBNF) [6] with a different interpretation, designed to represent a recursive descent parser. A PEG can be directly represented as a recursive-descent parser.
Like EBNF, PEG is a formal grammar for describing a formal language in terms of a set of rules used to recognize strings of this language. Unlike EBNF, PEGs have an exact interpretation. There is only one valid parse tree (see Abstract Syntax Tree) for each PEG grammar.
Sequences are represented by juxtaposition like in EBNF:
a b
The PEG expression above states that, in order for this to succeed, b
must follow a
.
Here's the syntax diagram:
Here's a trivial example:
'x' digit
which means the character x
must be followed by a digit.
Note | |
---|---|
In Spirit.X3, we use the |
Alternatives are represented in PEG using the slash:
a / b
Note | |
---|---|
In Spirit.X3, we use the |
Alternatives allow for choices. The expression above reads: try to match
a
. If a
succeeds, success, if not try to match b
.
This is a bit of a deviation from the usual EBNF interpretation where you
simply match a
or b
.
Here's the syntax diagram:
PEGs allow for ambiguity in the alternatives. In the expression above, both
a
or b
can both match an input string. However, only the first matching alternative
is valid. As noted, there can only be one valid parse tree.
Again, like EBNF, PEG uses the regular-expression Kleene star and the plus loops:
a* a+
Note | |
---|---|
Spirit.X3 uses the prefix star and plus since there is no postfix star or plus in C++. |
Here are the syntax diagrams:
The first, called the Kleene star, matches zero or more of its subject a
. The second, plus, matches one ore more
of its subject a
.
Unlike EBNF, PEGs have greedy loops. It will match as much as it can until its subject fails to match without regard to what follows. The following is a classic example of a fairly common EBNF/regex expression failing to match in PEG:
alnum* digit
In PEG, alnum will eat as much alpha-numeric characters as it can leaving nothing more left behind. Thus, the trailing digit will get nothing. Loops are simply implemented in recursive descent code as for/while loops making them extremely efficient. That is a definite advantage. On the other hand, those who are familiar with EBNF and regex behavior might find the behavior a major gotcha. PEG provides a couple of other mechanisms to circumvent this. We will see more of these other mechanisms shortly.
In some cases, you may want to restrict a certain expression. You can think of a PEG expression as a match for a potentially infinite set of strings. The difference operator allows you to restrict this set:
a - b
The expression reads: match a
but not b
.
[5] Bryan Ford: Parsing Expression Grammars: A Recognition-Based Syntactic Foundation, http://pdos.csail.mit.edu/~baford/packrat/popl04/
[6] Richard E. Pattis: EBNF: A Notation to Describe Syntax, http://www.cs.cmu.edu/~pattis/misc/ebnf.pdf