NAME¶
peg, leg - parser generators
SYNOPSIS¶
peg [-hvV -ooutput] [filename ...]
leg [-hvV -ooutput] [filename ...]
DESCRIPTION¶
peg and
leg are tools for generating recursive-descent parsers:
programs that perform pattern matching on text. They process a Parsing
Expression Grammar (PEG) [Ford 2004] to produce a program that recognises
legal sentences of that grammar.
peg processes PEGs written using the
original syntax described by Ford;
leg processes PEGs written using
slightly different syntax and conventions that are intended to make it an
attractive replacement for parsers built with
lex(1) and
yacc(1). Unlike
lex and
yacc,
peg and
leg
support unlimited backtracking, provide ordered choice as a means for
disambiguation, and can combine scanning (lexical analysis) and parsing
(syntactic analysis) into a single activity.
peg reads the specified
filenames, or standard input if no
filenames are given, for a grammar describing the parser to generate.
peg then generates a C source file that defines a function
yyparse(). This C source file can be included in, or compiled and then
linked with, a client program. Each time the client program calls
yyparse() the parser consumes input text according to the parsing
rules, starting from the first rule in the grammar.
yyparse() returns
non-zero if the input could be parsed according to the grammar; it returns
zero if the input could not be parsed.
The prefix 'yy' or 'YY' is prepended to all externally-visible symbols in the
generated parser. This is intended to reduce the risk of namespace pollution
in client programs. (The choice of 'yy' is historical; see
lex(1) and
yacc(1), for example.)
OPTIONS¶
peg and
leg provide the following options:
- -h
- prints a summary of available options and then exits.
- -ooutput
- writes the generated parser to the file output instead of the
standard output.
- -v
- writes verbose information to standard error while working.
- -V
- writes version information to standard error then exits.
A SIMPLE EXAMPLE¶
The following
peg input specifies a grammar with a single rule (called
'start') that is satisfied when the input contains the string
"username".
start <- "username"
(The quotation marks are
not part of the matched text; they serve to
indicate a literal string to be matched.) In other words,
yyparse() in
the generated C source will return non-zero only if the next eight characters
read from the input spell the word "username". If the input contains
anything else,
yyparse() returns zero and no input will have been
consumed. (Subsequent calls to
yyparse() will also return zero, since
the parser is effectively blocked looking for the string
"username".) To ensure progress we can add an alternative clause to
the 'start' rule that will match any single character if "username"
is not found.
start <- "username"
/ .
yyparse() now always returns non-zero (except at the very end of the
input). To do something useful we can add actions to the rules. These actions
are performed after a complete match is found (starting from the first rule)
and are chosen according to the 'path' taken through the grammar to match the
input. (Linguists would call this path a 'phrase marker'.)
start <- "username" { printf("%s\n", getlogin()); }
/ < . > { putchar(yytext[0]); }
The first line instructs the parser to print the user's login name whenever it
sees "username" in the input. If that match fails, the second line
tells the parser to echo the next character on the input the standard output.
Our parser is now performing useful work: it will copy the input to the
output, replacing all occurrences of "username" with the user's
account name.
Note the angle brackets ('<' and '>') that were added to the second
alternative. These have no effect on the meaning of the rule, but serve to
delimit the text made available to the following action in the variable
yytext.
If the above grammar is placed in the file
username.peg, running the
command
peg -o username.c username.peg
will save the corresponding parser in the file
username.c. To create a
complete program this parser could be included by a C program as follows.
#include <stdio.h> /* printf(), putchar() */
#include <unistd.h> /* getlogin() */
#include "username.c" /* yyparse() */
int main()
{
while (yyparse()) /* repeat until EOF */
;
return 0;
}
PEG GRAMMARS¶
A grammar consists of a set of named rules.
name <- pattern
The
pattern contains one or more of the following elements.
- name
- The element stands for the entire pattern in the rule with the given
name.
- "characters"
- A character or string enclosed in double quotes is matched literally. The
ANSI C escape sequences are recognised within the characters.
- 'characters'
- A character or string enclosed in single quotes is matched literally, as
above.
- [characters]
- A set of characters enclosed in square brackets matches any single
character from the set, with escape characters recognised as above. If the
set begins with an uparrow (^) then the set is negated (the element
matches any character not in the set). Any pair of characters
separated with a dash (-) represents the range of characters from the
first to the second, inclusive. A single alphabetic character or
underscore is matched by the following set.
[a-zA-Z_]
Similarly, the following matches any single non-digit character.
[^0-9]
- .
- A dot matches any character. Note that the only time this fails is at the
end of file, where there is no character to match.
- ( pattern )
- Parentheses are used for grouping (modifying the precedence of the
operators described below).
- { action }
- Curly braces surround actions. The action is arbitrary C source code to be
executed at the end of matching. Any braces within the action must be
properly nested. Any input text that was matched before the action and
delimited by angle brackets (see below) is made available within the
action as the contents of the character array yytext. The length of
(number of characters in) yytext is available in the variable
yyleng. (These variable names are historical; see
lex(1).)
- <
- An opening angle bracket always matches (consuming no input) and causes
the parser to begin accumulating matched text. This text will be made
available to actions in the variable yytext.
- >
- A closing angle bracket always matches (consuming no input) and causes the
parser to stop accumulating text for yytext.
The above
elements can be made optional and/or repeatable with the
following suffixes:
- element ?
- The element is optional. If present on the input, it is consumed and the
match succeeds. If not present on the input, no text is consumed and the
match succeeds anyway.
- element +
- The element is repeatable. If present on the input, one or more
occurrences of element are consumed and the match succeeds. If no
occurrences of element are present on the input, the match
fails.
- element *
- The element is optional and repeatable. If present on the input, one or
more occurrences of element are consumed and the match succeeds. If
no occurrences of element are present on the input, the match
succeeds anyway.
The above elements and suffixes can be converted into predicates (that match
arbitrary input text and subsequently succeed or fail
without consuming
that input) with the following prefixes:
- & element
- The predicate succeeds only if element can be matched. Input text
scanned while matching element is not consumed from the input and
remains available for subsequent matching.
- ! element
- The predicate succeeds only if element cannot be matched. Input
text scanned while matching element is not consumed from the input
and remains available for subsequent matching. A popular idiom is
!.
which matches the end of file, after the last character of the input has
already been consumed.
A special form of the '&' predicate is provided:
- &{ expression }
- In this predicate the simple C expression (not statement) is
evaluated immediately when the parser reaches the predicate. If the
expression yields non-zero (true) the 'match' succeeds and the
parser continues with the next element in the pattern. If the
expression yields zero (false) the 'match' fails and the parser
backs up to look for an alternative parse of the input.
Several elements (with or without prefixes and suffixes) can be combined into a
sequence by writing them one after the other. The entire sequence
matches only if each individual element within it matches, from left to right.
Sequences can be separated into disjoint alternatives by the alternation
operator '/'.
- sequence-1 / sequence-2 / ... / sequence-N
- Each sequence is tried in turn until one of them matches, at which time
matching for the overall pattern succeeds. If none of the sequences
matches then the match of the overall pattern fails.
Finally, the pound sign (#) introduces a comment (discarded) that continues
until the end of the line.
To summarise the above, the parser tries to match the input text against a
pattern containing literals, names (representing other rules), and various
operators (written as prefixes, suffixes, juxtaposition for sequencing and and
infix alternation operator) that modify how the elements within the pattern
are matched. Matches are made from left to right, 'descending' into named
sub-rules as they are encountered. If the matching process fails, the parser
'back tracks' ('rewinding' the input appropriately in the process) to find the
nearest alternative 'path' through the grammar. In other words the parser
performs a depth-first, left-to-right search for the first
successfully-matching path through the rules. If found, the actions along the
successful path are executed (in the order they were encountered).
Note that predicates are evaluated
immediately during the search for a
successful match, since they contribute to the success or failure of the
search. Actions, however, are evaluated only after a successful match has been
found.
PEG GRAMMAR FOR PEG GRAMMARS¶
The grammar for
peg grammars is shown below. This will both illustrate
and formalise the above description.
Grammar <- Spacing Definition+ EndOfFile
Definition <- Identifier LEFTARROW Expression
Expression <- Sequence ( SLASH Sequence )*
Sequence <- Prefix*
Prefix <- AND Action
/ ( AND | NOT )? Suffix
Suffix <- Primary ( QUERY / STAR / PLUS )?
Primary <- Identifier !LEFTARROW
/ OPEN Expression CLOSE
/ Literal
/ Class
/ DOT
/ Action
/ BEGIN
/ END
Identifier <- < IdentStart IdentCont* > Spacing
IdentStart <- [a-zA-Z_]
IdentCont <- IdentStart / [0-9]
Literal <- ['] < ( !['] Char )* > ['] Spacing
/ ["] < ( !["] Char )* > ["] Spacing
Class <- '[' < ( !']' Range )* > ']' Spacing
Range <- Char '-' Char / Char
Char <- '\\' [abefnrtv'"\[\]\\]
/ '\\' [0-3][0-7][0-7]
/ '\\' [0-7][0-7]?
/ '\\' '-'
/ !'\\' .
LEFTARROW <- '<-' Spacing
SLASH <- '/' Spacing
AND <- '&' Spacing
NOT <- '!' Spacing
QUERY <- '?' Spacing
STAR <- '*' Spacing
PLUS <- '+' Spacing
OPEN <- '(' Spacing
CLOSE <- ')' Spacing
DOT <- '.' Spacing
Spacing <- ( Space / Comment )*
Comment <- '#' ( !EndOfLine . )* EndOfLine
Space <- ' ' / '\t' / EndOfLine
EndOfLine <- '\r\n' / '\n' / '\r'
EndOfFile <- !.
Action <- '{' < [^}]* > '}' Spacing
BEGIN <- '<' Spacing
END <- '>' Spacing
LEG GRAMMARS¶
leg is a variant of
peg that adds some features of
lex(1)
and
yacc(1). It differs from
peg in the following ways.
- %{ text... %}
- A declaration section can appear anywhere that a rule definition is
expected. The text between the delimiters '%{' and '%}' is copied
verbatim to the generated C parser code before the code that
implements the parser itself.
- name = pattern
- The 'assignment' operator replaces the left arrow operator '<-'.
- rule-name
- Hyphens can appear as letters in the names of rules. Each hyphen is
converted into an underscore in the generated C source code. A single
single hyphen '-' is a legal rule name.
- = [ \t\n\r]*
number = [0-9]+ -
name = [a-zA-Z_][a-zA_Z_0-9]* -
l-paren = '(' -
r-paren = ')' -
This example shows how ignored whitespace can be obvious when reading the
grammar and yet unobtrusive when placed liberally at the end of every rule
associated with a lexical element.
- seq-1 | seq-2
- The alternation operator is vertical bar '|' rather than forward slash
'/'. The peg rule
name <- sequence-1
/ sequence-2
/ sequence-3
is therefore written
name = sequence-1
| sequence-2
| sequence-3
;
in leg (with the final semicolon being optional, as described
next).
- exp ~ { action }
- A postfix operator ~{ action } can be placed
after any expression and will behave like a normal action (arbitrary C
code) except that it is invoked only when exp fails. It binds less
tightly than any other operator except alternation and sequencing, and is
intended to make error handling and recovery code easier to write. Note
that yytext and yyleng are not available inside these
actions, but the pointer variable yy is available to give the code
access to any user-defined members of the parser state (see
"CUSTOMISING THE PARSER" below). Note also that exp is
always a single expression; to invoke an error action for any failure
within a sequence, parentheses must be used to group the sequence into a
single expression.
rule = e1 e2 e3 ~{ error("e[12] ok; e3 has failed"); }
| ...
rule = (e1 e2 e3) ~{ error("one of e[123] has failed"); }
| ...
- pattern ;
- A semicolon punctuator can optionally terminate a pattern.
- %% text...
- A double percent '%%' terminates the rules (and declarations) section of
the grammar. All text following '%%' is copied verbatim to the
generated C parser code after the parser implementation code.
- $$ = value
- A sub-rule can return a semantic value from an action by assigning
it to the pseudo-variable '$$'. All semantic values must have the same
type (which defaults to 'int'). This type can be changed by defining
YYSTYPE in a declaration section.
- identifier:name
- The semantic value returned (by assigning to '$$') from the sub-rule
name is associated with the identifier and can be referred
to in subsequent actions.
The desk calculator example below illustrates the use of '$$' and ':'.
LEG EXAMPLE: A DESK CALCULATOR¶
The extensions in
leg described above allow useful parsers and evaluators
(including declarations, grammar rules, and supporting C functions such as
'main') to be kept within a single source file. To illustrate this we show a
simple desk calculator supporting the four common arithmetic operators and
named variables. The intermediate results of arithmetic evaluation will be
accumulated on an implicit stack by returning them as semantic values from
sub-rules.
%{
#include <stdio.h> /* printf() */
#include <stdlib.h> /* atoi() */
int vars[26];
%}
Stmt = - e:Expr EOL { printf("%d\n", e); }
| ( !EOL . )* EOL { printf("error\n"); }
Expr = i:ID ASSIGN s:Sum { $$ = vars[i] = s; }
| s:Sum { $$ = s; }
Sum = l:Product
( PLUS r:Product { l += r; }
| MINUS r:Product { l -= r; }
)* { $$ = l; }
Product = l:Value
( TIMES r:Value { l *= r; }
| DIVIDE r:Value { l /= r; }
)* { $$ = l; }
Value = i:NUMBER { $$ = atoi(yytext); }
| i:ID !ASSIGN { $$ = vars[i]; }
| OPEN i:Expr CLOSE { $$ = i; }
NUMBER = < [0-9]+ > - { $$ = atoi(yytext); }
ID = < [a-z] > - { $$ = yytext[0] - 'a'; }
ASSIGN = '=' -
PLUS = '+' -
MINUS = '-' -
TIMES = '*' -
DIVIDE = '/' -
OPEN = '(' -
CLOSE = ')' -
- = [ \t]*
EOL = '\n' | '\r\n' | '\r' | ';'
%%
int main()
{
while (yyparse())
;
return 0;
}
LEG GRAMMAR FOR LEG GRAMMARS¶
The grammar for
leg grammars is shown below. This will both illustrate
and formalise the above description.
grammar = -
( declaration | definition )+
trailer? end-of-file
declaration = '%{' < ( !'%}' . )* > RPERCENT
trailer = '%%' < .* >
definition = identifier EQUAL expression SEMICOLON?
expression = sequence ( BAR sequence )*
sequence = error+
error = prefix ( TILDE action )?
prefix = AND action
| ( AND | NOT )? suffix
suffix = primary ( QUERY | STAR | PLUS )?
primary = identifier COLON identifier !EQUAL
| identifier !EQUAL
| OPEN expression CLOSE
| literal
| class
| DOT
| action
| BEGIN
| END
identifier = < [-a-zA-Z_][-a-zA-Z_0-9]* > -
literal = ['] < ( !['] char )* > ['] -
| ["] < ( !["] char )* > ["] -
class = '[' < ( !']' range )* > ']' -
range = char '-' char | char
char = '\\' [abefnrtv'"\[\]\\]
| '\\' [0-3][0-7][0-7]
| '\\' [0-7][0-7]?
| !'\\' .
action = '{' < braces* > '}' -
braces = '{' braces* '}'
| !'}' .
EQUAL = '=' -
COLON = ':' -
SEMICOLON = ';' -
BAR = '|' -
AND = '&' -
NOT = '!' -
QUERY = '?' -
STAR = '*' -
PLUS = '+' -
OPEN = '(' -
CLOSE = ')' -
DOT = '.' -
BEGIN = '<' -
END = '>' -
TILDE = '~' -
RPERCENT = '%}' -
- = ( space | comment )*
space = ' ' | '\t' | end-of-line
comment = '#' ( !end-of-line . )* end-of-line
end-of-line = '\r\n' | '\n' | '\r'
end-of-file = !.
CUSTOMISING THE PARSER¶
The following symbols can be redefined in declaration sections to modify the
generated parser code.
- YYSTYPE
- The semantic value type. The pseudo-variable '$$' and the identifiers
'bound' to rule results with the colon operator ':' should all be
considered as being declared to have this type. The default value is
'int'.
- YYPARSE
- The name of the main entry point to the parser. The default value is
'yyparse'.
- YYPARSEFROM
- The name of an alternative entry point to the parser. This function
expects one argument: the function corresponding to the rule from which
the search for a match should begin. The default is 'yyparsefrom'. Note
that yyparse() is defined as
int yyparse() { return yyparsefrom(yy_foo); }
where 'foo' is the name of the first rule in the grammar.
- YY_INPUT(buf, result, max_size)
- This macro is invoked by the parser to obtain more input text. buf
points to an area of memory that can hold at most max_size
characters. The macro should copy input text to buf and then assign
the integer variable result to indicate the number of characters
copied. If no more input is available, the macro should assign 0 to
result. By default, the YY_INPUT macro is defined as follows.
#define YY_INPUT(buf, result, max_size) \
{ \
int yyc= getchar(); \
result= (EOF == yyc) ? 0 : (*(buf)= yyc, 1); \
}
Note that if YY_CTX_LOCAL is defined (see below) then an additional first
argument, containing the parser context, is passed to YY_INPUT.
- YY_DEBUG
- If this symbols is defined then additional code will be included in the
parser that prints vast quantities of arcane information to the standard
error while the parser is running.
- YY_BEGIN
- This macro is invoked to mark the start of input text that will be made
available in actions as 'yytext'. This corresponds to occurrences of
'<' in the grammar. These are converted into predicates that are
expected to succeed. The default definition
#define YY_BEGIN (yybegin= yypos, 1)
therefore saves the current input position and returns 1 ('true') as the
result of the predicate.
- YY_END
- This macros corresponds to '>' in the grammar. Again, it is a predicate
so the default definition saves the input position before 'succeeding'.
#define YY_END (yyend= yypos, 1)
- YY_PARSE(T)
- This macro declares the parser entry points (yyparse and yyparsefrom) to
be of type T. The default definition
#define YY_PARSE(T) T
leaves yyparse() and yyparsefrom() with global visibility. If they should
not be externally visible in other source files, this macro can be
redefined to declare them 'static'.
#define YY_PARSE(T) static T
- YY_CTX_LOCAL
- If this symbol is defined during compilation of a generated parser then
global parser state will be kept in a structure of type 'yycontext' which
can be declared as a local variable. This allows multiple instances of
parsers to coexist and to be thread-safe. The parsing function
yyparse() will be declared to expect a first argument of type
'yycontext *', an instance of the structure holding the global state for
the parser. This instance must be allocated and initialised to zero by the
client. A trivial but complete example is as follows.
#include <stdio.h>
#define YY_CTX_LOCAL
#include "the-generated-parser.peg.c"
int main()
{
yycontext ctx;
memset(&ctx, 0, sizeof(yycontext));
while (yyparse(&ctx));
return 0;
}
Note that if this symbol is undefined then the compiled parser will
statically allocate its global state and will be neither reentrant nor
thread-safe. Note also that the parser yycontext structure is initialised
automatically the first time yyparse() is called; this structure
must therefore be properly initialised to zero before the first
call to yyparse().
- YY_CTX_MEMBERS
- If YY_CTX_LOCAL is defined (see above) then the macro YY_CTX_MEMBERS can
be defined to expand to any additional member field declarations that the
client would like included in the declaration of the 'yycontext' structure
type. These additional members are otherwise ignored by the generated
parser. The instance of 'yycontext' associated with the currently-active
parser is available within actions as the pointer variable yy.
- YY_BUFFER_SIZE
- The initial size of the text buffer, in bytes. The default is 1024 and the
buffer size is doubled whenever required to meet demand during parsing. An
application that typically parses much longer strings could increase this
to avoid unnecessary buffer reallocation.
- YY_STACK_SIZE
- The initial size of the variable and action stacks. The default is 128,
which is doubled whenever required to meet demand during parsing.
Applications that have deep call stacks with many local variables, or that
perform many actions after a single successful match, could increase this
to avoid unnecessary buffer reallocation.
- YY_MALLOC(YY, SIZE)
- The memory allocator for all parser-related storage. The parameters are
the current yycontext structure and the number of bytes to allocate. The
default definition is: malloc(SIZE)
- YY_REALLOC(YY, PTR, SIZE)
- The memory reallocator for dynamically-grown storage (such as text buffers
and variable stacks). The parameters are the current yycontext structure,
the previously-allocated storage, and the number of bytes to which that
storage should be grown. The default definition is:
realloc(PTR, SIZE)
- YY_FREE(YY, PTR)
- The memory deallocator. The parameters are the current yycontext structure
and the storage to deallocate. The default definition is:
free(PTR)
- YYRELEASE
- The name of the function that releases all resources held by a yycontext
structure. The default value is 'yyrelease'.
The following variables can be referred to within actions.
- char *yybuf
- This variable points to the parser's input buffer used to store input text
that has not yet been matched.
- int yypos
- This is the offset (in yybuf) of the next character to be matched and
consumed.
- char *yytext
- The most recent matched text delimited by '<' and '>' is stored in
this variable.
- int yyleng
- This variable indicates the number of characters in 'yytext'.
- yycontext *yy
- This variable points to the instance of 'yycontext' associated with the
currently-active parser.
Programs that wish to release all the resources associated with a parser can use
the following function.
- yyrelease(yycontext*yy)
- Returns all parser-allocated storage associated with yy to the
system. The storage will be reallocated on the next call to
yyparse().
Note that the storage for the yycontext structure itself is never allocated or
reclaimed implicitly. The application must allocate these structures in
automatic storage, or use
calloc() and
free() to manage them
explicitly. The example in the following section demonstrates one approach to
resource management.
LEG EXAMPLE: EXTENDING THE PARSER'S CONTEXT¶
The
yy variable passed to actions contains the state of the parser plus
any additional fields defined by YY_CTX_MEMBERS. Theses fields can be used to
store application-specific information that is global to a particular call of
yyparse(). A trivial but complete
leg example follows in which
the yycontext structure is extended with a
count of the number of
newline characters seen in the input so far (the grammar otherwise consumes
and ignores the entire input). The caller of
yyparse() uses
count to print the number of lines of input that were read.
%{
#define YY_CTX_LOCAL 1
#define YY_CTX_MEMBERS \
int count;
%}
Char = ('\n' | '\r\n' | '\r') { yy->count++ }
| .
%%
#include <stdio.h>
#include <string.h>
int main()
{
/* create a local parser context in automatic storage */
yycontext yy;
/* the context *must* be initialised to zero before first use*/
memset(&yy, 0, sizeof(yy));
while (yyparse(&yy))
;
printf("%d newlines\n", yy.count);
/* release all resources associated with the context */
yyrelease(&yy);
return 0;
}
DIAGNOSTICS¶
peg and
leg warn about the following conditions while converting a
grammar into a parser.
- syntax error
- The input grammar was malformed in some way. The error message will
include the text about to be matched (often backed up a huge amount from
the actual location of the error) and the line number of the most recently
considered character (which is often the real location of the
problem).
- rule 'foo' used but not defined
- The grammar referred to a rule named 'foo' but no definition for it was
given. Attempting to use the generated parser will likely result in errors
from the linker due to undefined symbols associated with the missing
rule.
- rule 'foo' defined but not used
- The grammar defined a rule named 'foo' and then ignored it. The code
associated with the rule is included in the generated parser which will in
all other respects be healthy.
- possible infinite left recursion in rule 'foo'
- There exists at least one path through the grammar that leads from the
rule 'foo' back to (a recursive invocation of) the same rule without
consuming any input.
Left recursion, especially that found in standards documents, is often 'direct'
and implies trivial repetition.
# (6.7.6)
direct-abstract-declarator =
LPAREN abstract-declarator RPAREN
| direct-abstract-declarator? LBRACKET assign-expr? RBRACKET
| direct-abstract-declarator? LBRACKET STAR RBRACKET
| direct-abstract-declarator? LPAREN param-type-list? RPAREN
The recursion can easily be eliminated by converting the parts of the pattern
following the recursion into a repeatable suffix.
# (6.7.6)
direct-abstract-declarator =
direct-abstract-declarator-head?
direct-abstract-declarator-tail*
direct-abstract-declarator-head =
LPAREN abstract-declarator RPAREN
direct-abstract-declarator-tail =
LBRACKET assign-expr? RBRACKET
| LBRACKET STAR RBRACKET
| LPAREN param-type-list? RPAREN
CAVEATS¶
A parser that accepts empty input will
always succeed. Consider the
following example, not atypical of a first attempt to write a PEG-based
parser:
Program = Expression*
Expression = "whatever"
%%
int main() {
while (yyparse())
puts("success!");
return 0;
}
This program loops forever, no matter what (if any) input is provided on stdin.
Many fixes are possible, the easiest being to insist that the parser always
consumes some non-empty input. Changing the first line to
Program = Expression+
accomplishes this. If the parser is expected to consume the entire input, then
explicitly requiring the end-of-file is also highly recommended:
Program = Expression+ !.
This works because the parser will only fail to match ("!" predicate)
any character at all ("." expression) when it attempts to read
beyond the end of the input.
BUGS¶
You have to type 'man peg' to read the manual page for
leg(1).
The 'yy' and 'YY' prefixes cannot be changed.
Left recursion is detected in the input grammar but is not handled correctly in
the generated parser.
Diagnostics for errors in the input grammar are obscure and not particularly
helpful.
The operators
! and
~ should really be named the other way
around.
Several commonly-used
lex(1) features (yywrap(), yyin, etc.) are
completely absent.
The generated parser does not contain '#line' directives to direct C compiler
errors back to the grammar description when appropriate.
SEE ALSO¶
D. Val Schorre,
META II, a syntax-oriented compiler writing language,
19th ACM National Conference, 1964, pp. 41.301--41.311. Describes a
self-implementing parser generator for analytic grammars with no backtracking.
Alexander Birman,
The TMG Recognition Schema, Ph.D. dissertation,
Princeton, 1970. A mathematical treatment of the power and complexity of
recursive-descent parsing with backtracking.
Bryan Ford,
Parsing Expression Grammars: A Recognition-Based Syntactic
Foundation, ACM SIGPLAN Symposium on Principles of Programming Languages,
2004. Defines PEGs and analyses them in relation to context-free and regular
grammars. Introduces the syntax adopted in
peg.
The standard Unix utilities
lex(1) and
yacc(1) which influenced
the syntax and features of
leg.
The source code for
peg and
leg whose grammar parsers are written
using themselves.
The latest version of this software and documentation:
http://piumarta.com/software/peg
AUTHOR¶
peg,
leg and this manual page were written by Ian Piumarta
(first-name at last-name dot com) while investigating the viability of regular
and parsing-expression grammars for efficiently extracting type and signature
information from C header files.
Please send bug reports and suggestions for improvements to the author at the
above address.