NAME¶
Marpa::R2 - Release 2 of Marpa
Synopsis¶
use Marpa::R2;
my $dsl = <<'END_OF_DSL';
:default ::= action => [name,values]
lexeme default = latm => 1
Expression ::= Term action => ::first
Term ::=
Factor action => ::first
| Term '+' Term action => do_add
Factor ::=
Number action => ::first
| Factor '*' Factor action => do_multiply
Number ~ digits
digits ~ [\d]+
:discard ~ whitespace
whitespace ~ [\s]+
END_OF_DSL
my $grammar = Marpa::R2::Scanless::G->new( { source => \$dsl } );
my $recce = Marpa::R2::Scanless::R->new(
{ grammar => $grammar, semantics_package => 'My_Actions' } );
my $input = '42 * 1 + 7';
$recce->read( \$input );
my $value_ref = $recce->value;
my $value = $value_ref ? ${$value_ref} : 'No Parse';
sub My_Actions::do_add {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 + $t2;
}
sub My_Actions::do_multiply {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 * $t2;
}
Description¶
Overview¶
Marpa parses any language whose grammar can be written in BNF. That includes
recursive grammars, ambiguous grammars, infinitely ambiguous grammars and
grammars with useless or empty productions. Marpa does both left- and
right-recursion in linear time -- in fact if a grammar is in any class
currently in practical use, Marpa will parse it in linear time.
This document centers around a short tutorial of the Scanless interface (SLIF).
This is the interface most suitable for beginners. The SLIF is the most
suitable interface for most advanced uses as well.
A simple calculator¶
The synopsis shows the code for an extremely simple calculator. It handles only
addition and multiplication of integers. The sections which follow explain,
line by line, how it works. The explanation will assume that the reader
understands BNF and the basics of grammars -- what rules are, what symbols
are, what the start symbol of a grammar is, etc.
Marpa::R2::Scanless::G::new¶
my $dsl = <<'END_OF_DSL';
:default ::= action => [name,values]
lexeme default = latm => 1
Expression ::= Term action => ::first
Term ::=
Factor action => ::first
| Term '+' Term action => do_add
Factor ::=
Number action => ::first
| Factor '*' Factor action => do_multiply
Number ~ digits
digits ~ [\d]+
:discard ~ whitespace
whitespace ~ [\s]+
END_OF_DSL
my $grammar = Marpa::R2::Scanless::G->new( { source => \$dsl } );
The code first creates a new SLIF grammar. SLIF grammars are
"Marpa::R2::Scanless:G" objects. They are created with the
Marpa::R2::Scanless:G::new constructor. The arguments to
Marpa::R2::Scanless::G::new are references to hashes of named arguments. In
the key/value pairs of these hashes, the hash key is the name of the argument,
and the hash value is the value of the named argument.
In the example, there is only one named argument to the SLIF grammar
constructor: "source". The value of "source" must be a
reference to a string in the SLIF's domain-specific language (DSL). In this
example, the DSL consists of several rules and pseudo-rules.
The default pseudo-rule¶
:default ::= action => [name,values]
lexeme default = latm => 1
These two lines set useful defaults. The first sets a default semantics, one
which is especially useful for development. This is a finished script, so the
default semantics is not used much. We'll talk about this more when we discuss
semantics at the end.
The second line sets the longest acceptable tokens match (LATM) style of lexing,
which is what you'll almost always want. It is not the default for historical
reasons, so your scripts will almost always start with this line.
A G1 rule¶
Next follows a G1, or structural rule. Structural rules are the kinds of rules
typically seen in BNF -- they describe the symbols which provide the structure
of the grammar, but leave out details of whitespace. The SLIF also handles the
lexical details in this example, but it does it via L0 rules, which we will
see shortly.
Expression ::= Term action => ::first
As is normal for BNF rules, this rule consists of a left hand side symbol
(""Expression""), the BNF operator
(""::="") and a series of right hand side (RHS) symbols.
There is always exactly one left hand side (LHS) symbol. There may be any
number of RHS symbols. In the case of an empty rule, the number of RHS symbols
would be zero. In this rule, there is one RHS symbol,
""Term"".
The BNF operator (""::="") is what makes this rule a G1
(structural) rule. Later we will see lexical rules, which will use the match
operator (""~"").
After the rule is an adverb: "action => ::first". We'll explain the
purpose of the "action" adverbs when we discuss semantics
More complicated G1 rules¶
Term ::=
Factor action => ::first
| Term '+' Term action => do_add
This rule says that a "Term" may be one of two alternatives: either a
"Factor" or two "Term"'s separated by an addition
operator. Immediately following is another G1 rule defining a
"Factor". It is very similar in form to the one for
"Term".
Factor ::=
Number action => ::first
| Factor '*' Factor action => do_multiply
L0 rules¶
The structural rules define the high-level structure of the grammar, and ignore
details of whitespace, comments, etc. Now we look at how the low-level,
lexical issues are handled. This very simple calculator language does not
allow comments, but it does define whitespace.
:discard ~ whitespace
whitespace ~ [\s]+
The ":discard" rule is a pseudo-rule, which tells Marpa to use
whatever it matches to separate G1 symbols, but otherwise to ignore it -- to
"discard" it. "whitespace" is defined in the next rule as
a sequence of one or more spaces.
Note the match operator (""~"") in the rule defining
whitespace. It tells Marpa that this rule is lexical and should be interpreted
exactly as written, character by character.
The "whitespace" rule is a special kind of rule in two respects.
First, its RHS is followed by a quantifier (""+""), which
makes it a sequence rule. Aside from the quantifier, sequence rules may only
have a single symbol or character class on their RHS. The plus quantifier
(""+"") means a sequence of one or more items. The star
quantifier (""*"") is also allowed, and it indicates a
sequence of zero or more items.
The whitespace items are defined by a character class: "[\s]". Marpa
supports the same character classes, and the same character class syntax, as
Perl does.
The next pair of L0 rules define the "Number" symbol
Number ~ digits
digits ~ [\d]+
The above two rules say that a "Number" is a sequence of one or more
digits. "Number" is a lexeme -- a G1 symbol which is defined and
recognized at the lexical (L0) level. In this example, there are three other
lexemes: "whitespace", and the addition and multiplication
operators.
We've already looked at the "whitespace" lexeme, which will be
discarded without being seen by G1. The addition and multiplication operators
were defined with single quoted strings in the G1 rules. As a reminder, here's
the rule for "Term" again:
Term ::=
Factor action => ::first
| Term '+' Term action => do_add
In the above rule, the single-quoted string '+' implicitly defines a L0 lexeme.
Something similar happens with the '*' string in the rule for a
"Factor".
The SLIF's lexer mostly "does what you mean". While the input is being
read, it looks for all lexemes defined in the DSL. Almost always, you'll want
Marpa to look only for tokens that are actually acceptable to the parse.
Telling Marpa to do so was the purpose of this line:
lexeme default = latm => 1
LATM means "longest acceptable tokens match". (LATM is not the default
for historical reasons.)
Among the acceptable tokens, Marpa looks for longest matches -- if multiple
tokens match, the longest match is the winner. Marpa tolerates ambiguity, so
one feature special to Marpa is that LATM is a longest acceptable
tokens match -- if more than one token is longest, all of them are
considered in the parse. The logic of SLIF lexing is described with more
precision in the SLIF overview document.
Marpa::R2::Scanless::R::new¶
my $recce = Marpa::R2::Scanless::R->new(
{ grammar => $grammar, semantics_package => 'My_Actions' } );
"Marpa::R2::Scanless::R::new" creates a new SLIF recognizer. Its
arguments are references to hashes of named arguments. In this example the
first named argument is the required argument:
""grammar"". The value of the "grammar" named
argument must be a Marpa::R2 SLIF grammar.
The second argument is optional, but you will use it frequently. The
""semantics_package"" named argument tells Marpa in which
Perl package to look for the closures implementing the semantics for this
grammar. We will talk more about this below.
Marpa::R2::Scanless::R::read¶
my $input = '42 * 1 + 7';
$recce->read( \$input );
To parse a string, we use the "Marpa::R2::Scanless::R::read()" method.
In its simplest form, as here, the "Marpa::R2::Scanless::R::read()"
takes a reference to a string containing the input stream as its argument.
Marpa::R2::Scanless::R::value¶
my $value_ref = $recce->value;
my $value = $value_ref ? ${$value_ref} : 'No Parse';
The "Marpa::R2::Scanless::R::value()" method returns a reference to
the parse result's value, if there was a parse result. If there was no parse
result, "Marpa::R2::Scanless::R::value()" returns "undef".
We have yet to describe how the Marpa SLIF computes the value of a parse. In
fact, up to this point, we have been skipping everything that had to do with
semantics. Now it is time to go back to those features.
Semantics¶
The value of the parse result, as returned via the "value()" method,
is determined by the parse's
semantics. Marpa's semantics are the
traditional ones: The input is seen as a tree which takes its structure from
the G1 rules. (This is why the G1 rules are called structural.) The value of
the parse results from repeatedly evaluating nodes of this tree, starting at
the bottom, with the results of child nodes made available to their parent
node when the parent node is evaluated.
Parse trees are usually drawn upside-down with their root at the top, and their
"leaves" at the bottom. In Marpa::R2's SLIF, the "leaves"
are the symbols that the G1 (structural) rules share with the L0 (lexical)
rules. The symbols shared by L0 and G1 are those lexemes which are not
discarded. In this example, the lexemes visible to G1 are "Number"
and two operators which are specified with a quoted string:
""+"" and ""*"".
Marpa assigns values to the nodes of the tree, starting with the leaves. Marpa's
"leaves" will always be L0 symbols, and their value by default is
the literal value at their location in the input stream. In the case of the
two operators described by quoted string, the value is that quoted string.
That is, the value of '"+"' is '"+"', and the value of
'"*"' is '"*"'. The value of "Number" will be
the portion of the input that matched the "[\d]+" pattern.
Starting with the values for leaves, Marpa::R2 moves recursively "up"
the tree to its root, assigning a value to each node of the tree based on the
value of its child nodes. Each non-leaf node corresponds to a G1 rule, and the
children of the non-leaf node correspond to the RHS symbols of the rule. When
the non-leaf node is valued, its value becomes the value of its LHS symbol,
and this value will become the value of a RHS symbol of another node with one
exception.
The one exception, the node with a LHS symbol that does not become a RHS symbol,
is the value of the top (or "root") node. The value of the top node
becomes the value of the parse, and this is the parse result value to which
the "value()" method returns a reference.
:default ::= action => [name,values]
Each non-leaf node determines its value with an action. The default pseudo-rule
allows you to specify the default action. (It is a pseudo-rule because its
LHS, "":default"", is a pseudo-symbol, not a real one.)
Often actions are Perl functions, which in this context are called Perl
semantic closures.
my $recce = Marpa::R2::Scanless::R->new(
{ grammar => $grammar, semantics_package => 'My_Actions' } );
Above we saw the "semantics_package" named argument used when
constructing the SLIF recognizer. As we noted, this specifies the package that
is used to find the Perl semantic closures.
In this example the default semantics, as specified by the
"default_action" named argument, come from a "array
descriptor" named ""[name,values]"". This indicates
that, by default, the value of a rule is to be a reference to an array
consisting of the rule's name, followed by the values of its children.
In this case, the semantics is not actually used, and you would usually change
it to something more convenient for your application. But
""[name,values]"" is an excellent starting point when
you're first developing a DSL and, since this code is intended as a template,
we've kept it. For more about array descriptors, see the semantics document
The other way we specify semantics in this example is by using an
"action" adverb for a RHS alternative. We've seen the
"action" adverb several times, but skipped over it. Now it is time
to look at it.
Term ::=
Factor action => ::first
| Term '+' Term action => do_add
Factor ::=
Number action => ::first
| Factor '*' Factor action => do_multiply
The ""::first"" action indicates that the value of a rule is
to be the value of its first child, that is, the value corresponding to the
first symbol of the rule's RHS. (In the case of an empty rule, the value would
be a Perl "undef"). (The initial double colon indicates a reserved
action.)
The action for the second RHS alternative defining "Term" is
"do_add", and the action for the second RHS alternative defining
"Factor" is "do_multiply". To implement these actions, we
need to "resolve" their names -- map the action names into the Perl
closures which actually carry out the semantics.
The "semantics_package" specified the package where we can find the
actions: ""My_Actions"". So, to resolve the
"do_multiply" action, Marpa looks for a closure whose fully
qualified name is "My_Actions::do_multiply", which it finds:
sub My_Actions::do_multiply {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 * $t2;
}
The "do_add" action is resolved to a Perl semantic closure in much the
same way:
sub My_Actions::do_add {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 + $t2;
}
The Perl semantic closures are callbacks. They are called as each node in a
parse tree is evaluated.
Each Perl semantic closure is called with one or more arguments. The first
argument to a value action is always a per-parse-tree object, which the
callbacks can use as a scratchpad. In this example, the per-parse-tree object
is not used. The remaining arguments will be the values of the node's
"children" -- in other words, the values computed for each of its
RHS symbols, in order. If the action is for an empty rule, the per-parse-tree
object will be its only argument.
Every value action is expected to return a value. With one exception, this value
is passed up to a parent node as an argument. The exception is the value for
the start rule. The return value for the start rule becomes the parse result.
Tainted data¶
Marpa::R2 exists to allow its input to alter execution in flexible and powerful
ways. Marpa should not be used with untrusted input. In Perl' s taint mode, it
is a fatal error to use Marpa's SLIF interface with a tainted grammar, a
tainted input string, or tainted token values.
Threads¶
When used in a thread-safe Perl, Marpa::R2 should be thread-safe, with one
important restriction: All Marpa objects that share the same grammar must be
created and used within a single thread.
This restriction may be lifted someday, but in practice it does not seem
onerous. Note that you can use the same grammar in different threads by
creating grammars that are exact copies of each other, one grammar per thread.
The Marpa:: namespace¶
The "Marpa::" top-level namespace is reserved. For extensions to
Marpa, one appropriate place is the "MarpaX::" namespace. This
practice helps avoid namespace collisions, and follows a CPAN standard, as
exemplified by the "DBIx::" "LWPx::" and
"MooseX::" which are for extensions of, respectively, DBI, LWP and
Moose.
Other documents¶
This document gives a semi-tutorial overview of Marpa's Scanless interface
(SLIF). For more details about the SLIF, there is an overview, and pages
describing its DSL, its grammar methods, and its recognizer methods.
Marpa has two other interfaces. The thin interface provides direct access to the
underlying Libmarpa C library. Of the Perl interfaces to Marpa, the thin
interface is the most low-level. The thin interface offers efficient access to
the full power of the Marpa parse engine, but it requires the application to
do a lot of the work itself.
Now discouraged, the named argument inteface (NAIF) was Marpa::R2's first
interface. It is a more traditional, middle level interface which uses Perl
calls instead of a DSL.
Marpa::R2::Vocabulary is intended as a quick refresher in parsing terminology,
emphasizing how the standard terms are used in the Marpa context. Marpa's
standard semantics are fully described in the Marpa::R2::Semantics document.
Techniques for tracing and for debugging your Marpa grammars are described in
the Marpa::R2::Tracing document and the Marpa::R2::Progress document. For
those with a theoretical bent, my sources, and other useful references, are
described in Marpa::R2::Advanced::Bibliography.
Author¶
Jeffrey Kegler
Why is it called "Marpa"?¶
Marpa is the name of the greatest of the Tibetan "translators". In his
time (the 11th century AD) Indian Buddhism was at its height. Marpa's
generation of scholars was devoted to producing Tibetan versions of Buddhism's
Sanskrit scriptures. Marpa became the greatest of them, and today is known as
Marpa Lotsawa: "Marpa the Translator".
Blatant plug¶
Marpa is a character in my novel,
The God Proof.
The God
Proof centers around Kurt Goedel's proof of God's existence. Yes,
that Kurt Goedel, and yes, he really did work out a God Proof (it's in
his
Collected Works, Vol. 3, pp. 403-404).
The God Proof is
available as a free download (<
http://www.lulu.com/content/933192>). It
can be purchased in print form at Amazon.com:
<
http://www.amazon.com/God-Proof-Jeffrey-Kegler/dp/1434807355>.
Support¶
Marpa::R2 comes without warranty. Support is provided on a volunteer basis
through the standard mechanisms for CPAN modules. The Support document has
details.
Copyright and License¶
Copyright 2014 Jeffrey Kegler
This file is part of Marpa::R2. Marpa::R2 is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.
Marpa::R2 is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser
General Public License along with Marpa::R2. If not, see
http://www.gnu.org/licenses/.