This is the command leg that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator
PROGRAM:
NAME
peg, leg - parser generators
SYNOPSIS
peg [-hvV -ooutput] [filename ...]
leg [-hvV -ooutput] [filename ...]
DESCRIPTION
peg and leg are tools for generating recursive-descent parsers: programs that perform
pattern matching on text. They process a Parsing Expression Grammar (PEG) [Ford 2004] to
produce a program that recognises legal sentences of that grammar. peg processes PEGs
written using the original syntax described by Ford; leg processes PEGs written using
slightly different syntax and conventions that are intended to make it an attractive
replacement for parsers built with lex(1) and yacc(1). Unlike lex and yacc, peg and leg
support unlimited backtracking, provide ordered choice as a means for disambiguation, and
can combine scanning (lexical analysis) and parsing (syntactic analysis) into a single
activity.
peg reads the specified filenames, or standard input if no filenames are given, for a
grammar describing the parser to generate. peg then generates a C source file that
defines a function yyparse(). This C source file can be included in, or compiled and then
linked with, a client program. Each time the client program calls yyparse() the parser
consumes input text according to the parsing rules, starting from the first rule in the
grammar. yyparse() returns non-zero if the input could be parsed according to the
grammar; it returns zero if the input could not be parsed.
The prefix 'yy' or 'YY' is prepended to all externally-visible symbols in the generated
parser. This is intended to reduce the risk of namespace pollution in client programs.
(The choice of 'yy' is historical; see lex(1) and yacc(1), for example.)
OPTIONS
peg and leg provide the following options:
-h prints a summary of available options and then exits.
-ooutput
writes the generated parser to the file output instead of the standard output.
-v writes verbose information to standard error while working.
-V writes version information to standard error then exits.
A SIMPLE EXAMPLE
The following peg input specifies a grammar with a single rule (called 'start') that is
satisfied when the input contains the string "username".
start <- "username"
(The quotation marks are not part of the matched text; they serve to indicate a literal
string to be matched.) In other words, yyparse() in the generated C source will return
non-zero only if the next eight characters read from the input spell the word "username".
If the input contains anything else, yyparse() returns zero and no input will have been
consumed. (Subsequent calls to yyparse() will also return zero, since the parser is
effectively blocked looking for the string "username".) To ensure progress we can add an
alternative clause to the 'start' rule that will match any single character if "username"
is not found.
start <- "username"
/ .
yyparse() now always returns non-zero (except at the very end of the input). To do
something useful we can add actions to the rules. These actions are performed after a
complete match is found (starting from the first rule) and are chosen according to the
'path' taken through the grammar to match the input. (Linguists would call this path a
'phrase marker'.)
start <- "username" { printf("%s\n", getlogin()); }
/ < . > { putchar(yytext[0]); }
The first line instructs the parser to print the user's login name whenever it sees
"username" in the input. If that match fails, the second line tells the parser to echo
the next character on the input the standard output. Our parser is now performing useful
work: it will copy the input to the output, replacing all occurrences of "username" with
the user's account name.
Note the angle brackets ('<' and '>') that were added to the second alternative. These
have no effect on the meaning of the rule, but serve to delimit the text made available to
the following action in the variable yytext.
If the above grammar is placed in the file username.peg, running the command
peg -o username.c username.peg
will save the corresponding parser in the file username.c. To create a complete program
this parser could be included by a C program as follows.
#include <stdio.h> /* printf(), putchar() */
#include <unistd.h> /* getlogin() */
#include "username.c" /* yyparse() */
int main()
{
while (yyparse()) /* repeat until EOF */
;
return 0;
}
PEG GRAMMARS
A grammar consists of a set of named rules.
name <- pattern
The pattern contains one or more of the following elements.
name The element stands for the entire pattern in the rule with the given name.
"characters"
A character or string enclosed in double quotes is matched literally. The ANSI C
escape sequences are recognised within the characters.
'characters'
A character or string enclosed in single quotes is matched literally, as above.
[characters]
A set of characters enclosed in square brackets matches any single character from
the set, with escape characters recognised as above. If the set begins with an
uparrow (^) then the set is negated (the element matches any character not in the
set). Any pair of characters separated with a dash (-) represents the range of
characters from the first to the second, inclusive. A single alphabetic character
or underscore is matched by the following set.
[a-zA-Z_]
Similarly, the following matches any single non-digit character.
[^0-9]
. A dot matches any character. Note that the only time this fails is at the end of
file, where there is no character to match.
( pattern )
Parentheses are used for grouping (modifying the precedence of the operators
described below).
{ action }
Curly braces surround actions. The action is arbitrary C source code to be
executed at the end of matching. Any braces within the action must be properly
nested. Any input text that was matched before the action and delimited by angle
brackets (see below) is made available within the action as the contents of the
character array yytext. The length of (number of characters in) yytext is
available in the variable yyleng. (These variable names are historical; see
lex(1).)
< An opening angle bracket always matches (consuming no input) and causes the parser
to begin accumulating matched text. This text will be made available to actions in
the variable yytext.
> A closing angle bracket always matches (consuming no input) and causes the parser
to stop accumulating text for yytext.
The above elements can be made optional and/or repeatable with the following suffixes:
element ?
The element is optional. If present on the input, it is consumed and the match
succeeds. If not present on the input, no text is consumed and the match succeeds
anyway.
element +
The element is repeatable. If present on the input, one or more occurrences of
element are consumed and the match succeeds. If no occurrences of element are
present on the input, the match fails.
element *
The element is optional and repeatable. If present on the input, one or more
occurrences of element are consumed and the match succeeds. If no occurrences of
element are present on the input, the match succeeds anyway.
The above elements and suffixes can be converted into predicates (that match arbitrary
input text and subsequently succeed or fail without consuming that input) with the
following prefixes:
& element
The predicate succeeds only if element can be matched. Input text scanned while
matching element is not consumed from the input and remains available for
subsequent matching.
! element
The predicate succeeds only if element cannot be matched. Input text scanned while
matching element is not consumed from the input and remains available for
subsequent matching. A popular idiom is
!.
which matches the end of file, after the last character of the input has already
been consumed.
A special form of the '&' predicate is provided:
&{ expression }
In this predicate the simple C expression (not statement) is evaluated immediately
when the parser reaches the predicate. If the expression yields non-zero (true)
the 'match' succeeds and the parser continues with the next element in the pattern.
If the expression yields zero (false) the 'match' fails and the parser backs up to
look for an alternative parse of the input.
Several elements (with or without prefixes and suffixes) can be combined into a sequence
by writing them one after the other. The entire sequence matches only if each individual
element within it matches, from left to right.
Sequences can be separated into disjoint alternatives by the alternation operator '/'.
sequence-1 / sequence-2 / ... / sequence-N
Each sequence is tried in turn until one of them matches, at which time matching
for the overall pattern succeeds. If none of the sequences matches then the match
of the overall pattern fails.
Finally, the pound sign (#) introduces a comment (discarded) that continues until the end
of the line.
To summarise the above, the parser tries to match the input text against a pattern
containing literals, names (representing other rules), and various operators (written as
prefixes, suffixes, juxtaposition for sequencing and and infix alternation operator) that
modify how the elements within the pattern are matched. Matches are made from left to
right, 'descending' into named sub-rules as they are encountered. If the matching process
fails, the parser 'back tracks' ('rewinding' the input appropriately in the process) to
find the nearest alternative 'path' through the grammar. In other words the parser
performs a depth-first, left-to-right search for the first successfully-matching path
through the rules. If found, the actions along the successful path are executed (in the
order they were encountered).
Note that predicates are evaluated immediately during the search for a successful match,
since they contribute to the success or failure of the search. Actions, however, are
evaluated only after a successful match has been found.
PEG GRAMMAR FOR PEG GRAMMARS
The grammar for peg grammars is shown below. This will both illustrate and formalise the
above description.
Grammar <- Spacing Definition+ EndOfFile
Definition <- Identifier LEFTARROW Expression
Expression <- Sequence ( SLASH Sequence )*
Sequence <- Prefix*
Prefix <- AND Action
/ ( AND | NOT )? Suffix
Suffix <- Primary ( QUERY / STAR / PLUS )?
Primary <- Identifier !LEFTARROW
/ OPEN Expression CLOSE
/ Literal
/ Class
/ DOT
/ Action
/ BEGIN
/ END
Identifier <- < IdentStart IdentCont* > Spacing
IdentStart <- [a-zA-Z_]
IdentCont <- IdentStart / [0-9]
Literal <- ['] < ( !['] Char )* > ['] Spacing
/ ["] < ( !["] Char )* > ["] Spacing
Class <- '[' < ( !']' Range )* > ']' Spacing
Range <- Char '-' Char / Char
Char <- '\\' [abefnrtv'"\[\]\\]
/ '\\' [0-3][0-7][0-7]
/ '\\' [0-7][0-7]?
/ '\\' '-'
/ !'\\' .
LEFTARROW <- '<-' Spacing
SLASH <- '/' Spacing
AND <- '&' Spacing
NOT <- '!' Spacing
QUERY <- '?' Spacing
STAR <- '*' Spacing
PLUS <- '+' Spacing
OPEN <- '(' Spacing
CLOSE <- ')' Spacing
DOT <- '.' Spacing
Spacing <- ( Space / Comment )*
Comment <- '#' ( !EndOfLine . )* EndOfLine
Space <- ' ' / '\t' / EndOfLine
EndOfLine <- '\r\n' / '\n' / '\r'
EndOfFile <- !.
Action <- '{' < [^}]* > '}' Spacing
BEGIN <- '<' Spacing
END <- '>' Spacing
LEG GRAMMARS
leg is a variant of peg that adds some features of lex(1) and yacc(1). It differs from
peg in the following ways.
%{ text... %}
A declaration section can appear anywhere that a rule definition is expected. The
text between the delimiters '%{' and '%}' is copied verbatim to the generated C
parser code before the code that implements the parser itself.
name = pattern
The 'assignment' operator replaces the left arrow operator '<-'.
rule-name
Hyphens can appear as letters in the names of rules. Each hyphen is converted into
an underscore in the generated C source code. A single single hyphen '-' is a
legal rule name.
- = [ \t\n\r]*
number = [0-9]+ -
name = [a-zA-Z_][a-zA_Z_0-9]* -
l-paren = '(' -
r-paren = ')' -
This example shows how ignored whitespace can be obvious when reading the grammar
and yet unobtrusive when placed liberally at the end of every rule associated with
a lexical element.
seq-1 | seq-2
The alternation operator is vertical bar '|' rather than forward slash '/'. The
peg rule
name <- sequence-1
/ sequence-2
/ sequence-3
is therefore written
name = sequence-1
| sequence-2
| sequence-3
;
in leg (with the final semicolon being optional, as described next).
exp ~ { action }
A postfix operator ~{ action } can be placed after any expression and will behave
like a normal action (arbitrary C code) except that it is invoked only when exp
fails. It binds less tightly than any other operator except alternation and
sequencing, and is intended to make error handling and recovery code easier to
write. Note that yytext and yyleng are not available inside these actions, but the
pointer variable yy is available to give the code access to any user-defined
members of the parser state (see "CUSTOMISING THE PARSER" below). Note also that
exp is always a single expression; to invoke an error action for any failure within
a sequence, parentheses must be used to group the sequence into a single
expression.
rule = e1 e2 e3 ~{ error("e[12] ok; e3 has failed"); }
| ...
rule = (e1 e2 e3) ~{ error("one of e[123] has failed"); }
| ...
pattern ;
A semicolon punctuator can optionally terminate a pattern.
%% text...
A double percent '%%' terminates the rules (and declarations) section of the
grammar. All text following '%%' is copied verbatim to the generated C parser code
after the parser implementation code.
$$ = value
A sub-rule can return a semantic value from an action by assigning it to the
pseudo-variable '$$'. All semantic values must have the same type (which defaults
to 'int'). This type can be changed by defining YYSTYPE in a declaration section.
identifier:name
The semantic value returned (by assigning to '$$') from the sub-rule name is
associated with the identifier and can be referred to in subsequent actions.
The desk calculator example below illustrates the use of '$$' and ':'.
LEG EXAMPLE: A DESK CALCULATOR
The extensions in leg described above allow useful parsers and evaluators (including
declarations, grammar rules, and supporting C functions such as 'main') to be kept within
a single source file. To illustrate this we show a simple desk calculator supporting the
four common arithmetic operators and named variables. The intermediate results of
arithmetic evaluation will be accumulated on an implicit stack by returning them as
semantic values from sub-rules.
%{
#include <stdio.h> /* printf() */
#include <stdlib.h> /* atoi() */
int vars[26];
%}
Stmt = - e:Expr EOL { printf("%d\n", e); }
| ( !EOL . )* EOL { printf("error\n"); }
Expr = i:ID ASSIGN s:Sum { $$ = vars[i] = s; }
| s:Sum { $$ = s; }
Sum = l:Product
( PLUS r:Product { l += r; }
| MINUS r:Product { l -= r; }
)* { $$ = l; }
Product = l:Value
( TIMES r:Value { l *= r; }
| DIVIDE r:Value { l /= r; }
)* { $$ = l; }
Value = i:NUMBER { $$ = atoi(yytext); }
| i:ID !ASSIGN { $$ = vars[i]; }
| OPEN i:Expr CLOSE { $$ = i; }
NUMBER = < [0-9]+ > - { $$ = atoi(yytext); }
ID = < [a-z] > - { $$ = yytext[0] - 'a'; }
ASSIGN = '=' -
PLUS = '+' -
MINUS = '-' -
TIMES = '*' -
DIVIDE = '/' -
OPEN = '(' -
CLOSE = ')' -
- = [ \t]*
EOL = '\n' | '\r\n' | '\r' | ';'
%%
int main()
{
while (yyparse())
;
return 0;
}
LEG GRAMMAR FOR LEG GRAMMARS
The grammar for leg grammars is shown below. This will both illustrate and formalise the
above description.
grammar = -
( declaration | definition )+
trailer? end-of-file
declaration = '%{' < ( !'%}' . )* > RPERCENT
trailer = '%%' < .* >
definition = identifier EQUAL expression SEMICOLON?
expression = sequence ( BAR sequence )*
sequence = error+
error = prefix ( TILDE action )?
prefix = AND action
| ( AND | NOT )? suffix
suffix = primary ( QUERY | STAR | PLUS )?
primary = identifier COLON identifier !EQUAL
| identifier !EQUAL
| OPEN expression CLOSE
| literal
| class
| DOT
| action
| BEGIN
| END
identifier = < [-a-zA-Z_][-a-zA-Z_0-9]* > -
literal = ['] < ( !['] char )* > ['] -
| ["] < ( !["] char )* > ["] -
class = '[' < ( !']' range )* > ']' -
range = char '-' char | char
char = '\\' [abefnrtv'"\[\]\\]
| '\\' [0-3][0-7][0-7]
| '\\' [0-7][0-7]?
| !'\\' .
action = '{' < braces* > '}' -
braces = '{' braces* '}'
| !'}' .
EQUAL = '=' -
COLON = ':' -
SEMICOLON = ';' -
BAR = '|' -
AND = '&' -
NOT = '!' -
QUERY = '?' -
STAR = '*' -
PLUS = '+' -
OPEN = '(' -
CLOSE = ')' -
DOT = '.' -
BEGIN = '<' -
END = '>' -
TILDE = '~' -
RPERCENT = '%}' -
- = ( space | comment )*
space = ' ' | '\t' | end-of-line
comment = '#' ( !end-of-line . )* end-of-line
end-of-line = '\r\n' | '\n' | '\r'
end-of-file = !.
CUSTOMISING THE PARSER
The following symbols can be redefined in declaration sections to modify the generated
parser code.
YYSTYPE
The semantic value type. The pseudo-variable '$$' and the identifiers 'bound' to
rule results with the colon operator ':' should all be considered as being declared
to have this type. The default value is 'int'.
YYPARSE
The name of the main entry point to the parser. The default value is 'yyparse'.
YYPARSEFROM
The name of an alternative entry point to the parser. This function expects one
argument: the function corresponding to the rule from which the search for a match
should begin. The default is 'yyparsefrom'. Note that yyparse() is defined as
int yyparse() { return yyparsefrom(yy_foo); }
where 'foo' is the name of the first rule in the grammar.
YY_INPUT(buf, result, max_size)
This macro is invoked by the parser to obtain more input text. buf points to an
area of memory that can hold at most max_size characters. The macro should copy
input text to buf and then assign the integer variable result to indicate the
number of characters copied. If no more input is available, the macro should
assign 0 to result. By default, the YY_INPUT macro is defined as follows.
#define YY_INPUT(buf, result, max_size)
{
int yyc= getchar();
result= (EOF == yyc) ? 0 : (*(buf)= yyc, 1);
}
Note that if YY_CTX_LOCAL is defined (see below) then an additional first argument,
containing the parser context, is passed to YY_INPUT.
YY_DEBUG
If this symbols is defined then additional code will be included in the parser that
prints vast quantities of arcane information to the standard error while the parser
is running.
YY_BEGIN
This macro is invoked to mark the start of input text that will be made available
in actions as 'yytext'. This corresponds to occurrences of '<' in the grammar.
These are converted into predicates that are expected to succeed. The default
definition
#define YY_BEGIN (yybegin= yypos, 1)
therefore saves the current input position and returns 1 ('true') as the result of
the predicate.
YY_END This macros corresponds to '>' in the grammar. Again, it is a predicate so the
default definition saves the input position before 'succeeding'.
#define YY_END (yyend= yypos, 1)
YY_PARSE(T)
This macro declares the parser entry points (yyparse and yyparsefrom) to be of type
T. The default definition
#define YY_PARSE(T) T
leaves yyparse() and yyparsefrom() with global visibility. If they should not be
externally visible in other source files, this macro can be redefined to declare
them 'static'.
#define YY_PARSE(T) static T
YY_CTX_LOCAL
If this symbol is defined during compilation of a generated parser then global
parser state will be kept in a structure of type 'yycontext' which can be declared
as a local variable. This allows multiple instances of parsers to coexist and to
be thread-safe. The parsing function yyparse() will be declared to expect a first
argument of type 'yycontext *', an instance of the structure holding the global
state for the parser. This instance must be allocated and initialised to zero by
the client. A trivial but complete example is as follows.
#include <stdio.h>
#define YY_CTX_LOCAL
#include "the-generated-parser.peg.c"
int main()
{
yycontext ctx;
memset(&ctx, 0, sizeof(yycontext));
while (yyparse(&ctx));
return 0;
}
Note that if this symbol is undefined then the compiled parser will statically
allocate its global state and will be neither reentrant nor thread-safe. Note also
that the parser yycontext structure is initialised automatically the first time
yyparse() is called; this structure must therefore be properly initialised to zero
before the first call to yyparse().
YY_CTX_MEMBERS
If YY_CTX_LOCAL is defined (see above) then the macro YY_CTX_MEMBERS can be defined
to expand to any additional member field declarations that the client would like
included in the declaration of the 'yycontext' structure type. These additional
members are otherwise ignored by the generated parser. The instance of 'yycontext'
associated with the currently-active parser is available within actions as the
pointer variable yy.
YY_BUFFER_SIZE
The initial size of the text buffer, in bytes. The default is 1024 and the buffer
size is doubled whenever required to meet demand during parsing. An application
that typically parses much longer strings could increase this to avoid unnecessary
buffer reallocation.
YY_STACK_SIZE
The initial size of the variable and action stacks. The default is 128, which is
doubled whenever required to meet demand during parsing. Applications that have
deep call stacks with many local variables, or that perform many actions after a
single successful match, could increase this to avoid unnecessary buffer
reallocation.
YY_MALLOC(YY, SIZE)
The memory allocator for all parser-related storage. The parameters are the
current yycontext structure and the number of bytes to allocate. The default
definition is: malloc(SIZE)
YY_REALLOC(YY, PTR, SIZE)
The memory reallocator for dynamically-grown storage (such as text buffers and
variable stacks). The parameters are the current yycontext structure, the
previously-allocated storage, and the number of bytes to which that storage should
be grown. The default definition is: realloc(PTR, SIZE)
YY_FREE(YY, PTR)
The memory deallocator. The parameters are the current yycontext structure and the
storage to deallocate. The default definition is: free(PTR)
YYRELEASE
The name of the function that releases all resources held by a yycontext structure.
The default value is 'yyrelease'.
The following variables can be referred to within actions.
char *yybuf
This variable points to the parser's input buffer used to store input text that has
not yet been matched.
int yypos
This is the offset (in yybuf) of the next character to be matched and consumed.
char *yytext
The most recent matched text delimited by '<' and '>' is stored in this variable.
int yyleng
This variable indicates the number of characters in 'yytext'.
yycontext *yy
This variable points to the instance of 'yycontext' associated with the
currently-active parser.
Programs that wish to release all the resources associated with a parser can use the
following function.
yyrelease(yycontext*yy)
Returns all parser-allocated storage associated with yy to the system. The storage
will be reallocated on the next call to yyparse().
Note that the storage for the yycontext structure itself is never allocated or reclaimed
implicitly. The application must allocate these structures in automatic storage, or use
calloc() and free() to manage them explicitly. The example in the following section
demonstrates one approach to resource management.
LEG EXAMPLE: EXTENDING THE PARSER'S CONTEXT
The yy variable passed to actions contains the state of the parser plus any additional
fields defined by YY_CTX_MEMBERS. Theses fields can be used to store application-specific
information that is global to a particular call of yyparse(). A trivial but complete leg
example follows in which the yycontext structure is extended with a count of the number of
newline characters seen in the input so far (the grammar otherwise consumes and ignores
the entire input). The caller of yyparse() uses count to print the number of lines of
input that were read.
%{
#define YY_CTX_LOCAL 1
#define YY_CTX_MEMBERS
int count;
%}
Char = ('\n' | '\r\n' | '\r') { yy->count++ }
| .
%%
#include <stdio.h>
#include <string.h>
int main()
{
/* create a local parser context in automatic storage */
yycontext yy;
/* the context *must* be initialised to zero before first use*/
memset(&yy, 0, sizeof(yy));
while (yyparse(&yy))
;
printf("%d newlines\n", yy.count);
/* release all resources associated with the context */
yyrelease(&yy);
return 0;
}
DIAGNOSTICS
peg and leg warn about the following conditions while converting a grammar into a parser.
syntax error
The input grammar was malformed in some way. The error message will include the
text about to be matched (often backed up a huge amount from the actual location of
the error) and the line number of the most recently considered character (which is
often the real location of the problem).
rule 'foo' used but not defined
The grammar referred to a rule named 'foo' but no definition for it was given.
Attempting to use the generated parser will likely result in errors from the linker
due to undefined symbols associated with the missing rule.
rule 'foo' defined but not used
The grammar defined a rule named 'foo' and then ignored it. The code associated
with the rule is included in the generated parser which will in all other respects
be healthy.
possible infinite left recursion in rule 'foo'
There exists at least one path through the grammar that leads from the rule 'foo'
back to (a recursive invocation of) the same rule without consuming any input.
Left recursion, especially that found in standards documents, is often 'direct' and
implies trivial repetition.
# (6.7.6)
direct-abstract-declarator =
LPAREN abstract-declarator RPAREN
| direct-abstract-declarator? LBRACKET assign-expr? RBRACKET
| direct-abstract-declarator? LBRACKET STAR RBRACKET
| direct-abstract-declarator? LPAREN param-type-list? RPAREN
The recursion can easily be eliminated by converting the parts of the pattern following
the recursion into a repeatable suffix.
# (6.7.6)
direct-abstract-declarator =
direct-abstract-declarator-head?
direct-abstract-declarator-tail*
direct-abstract-declarator-head =
LPAREN abstract-declarator RPAREN
direct-abstract-declarator-tail =
LBRACKET assign-expr? RBRACKET
| LBRACKET STAR RBRACKET
| LPAREN param-type-list? RPAREN
CAVEATS
A parser that accepts empty input will always succeed. Consider the following example,
not atypical of a first attempt to write a PEG-based parser:
Program = Expression*
Expression = "whatever"
%%
int main() {
while (yyparse())
puts("success!");
return 0;
}
This program loops forever, no matter what (if any) input is provided on stdin. Many
fixes are possible, the easiest being to insist that the parser always consumes some
non-empty input. Changing the first line to
Program = Expression+
accomplishes this. If the parser is expected to consume the entire input, then explicitly
requiring the end-of-file is also highly recommended:
Program = Expression+ !.
This works because the parser will only fail to match ("!" predicate) any character at all
("." expression) when it attempts to read beyond the end of the input.
Use leg online using onworks.net services