NAME¶
Marpa::R2::NAIF::Progress - Progress reports for the NAIF
About this document¶
This document describes Marpa's progress reports for its named argument
interface (NAIF). If you are a beginner, or are not sure which interface you
are interested in, or do not know what the NAIF interfaces is, you probably
are looking for the document on progress reports for the SLIF interface.
Progress reports allow an application to know exactly where it is in the parse
at any point. For parse locations of the user's choosing, progress reports
list all the rules in play, and indicate the location at which the rule
started, and how far into the rule parsing has progressed.
For those new to parsing, this "situational awareness" might seem to
be a feature that they can expect from any servicable parser. In fact,
situation awareness in most parsers is extremely limited, and it tends to be
more limited in those parsers considered production quality. Marpa is not just
unusual in the amount of feedback it offers the application, it breaks new
ground.
Progress reports are extremely useful in debugging grammars. Because debugging
creates a clear "narrative", the detailed example in this document
is a debugging situation. Readers specifically interested in debugging a
grammar should read the document on tracing problems before reading this
document.
Introduction to Earley parsing¶
To read the "show_progress" output, it is important to have a basic
idea of what Earley items are, and of what the information in them means.
Everything that the user needs to know is explained in this section.
Dotted rules¶
The idea behind Earley's algorithm is that you can parse by building a table of
rules and where you are in those rules. "Where" means two things:
location in the rule relative to the rule's symbols, and location relative to
the parse's input stream.
Let's look at an example of a rule in a context-free grammar. Here's the rule
for assignment from the Perl distribution's "perly.y"
" termbinop -> term ASSIGNOP term"
"ASSIGNOP" is "perly.y"'s internal name for the assignment
operator. In plain Perl terms, this is the ""=""
character.
In parsing this rule, we can be at any of four possible locations. One location
is at the beginning, before all of the symbols. The other three locations are
immediately after each of the rule's three symbols.
Within a rule, position relative to the symbols of the rule is traditionally
indicated with a dot. In fact, the symbol-relative rule position is very often
called the
dot location. Taken as a pair, a rule and a dot location are
called a
dotted rule.
Here's our rule with a dot location indicated:
" termbinop -> X term ASSIGNOP term"
The dot location in this dotted rule is at the beginning. A dot location at the
beginning of a dotted rule means that we have not recognized any symbols in
the rule yet. All we are doing is predicting that the rule will occur. A
dotted rule with the dot before all of its symbols is called a
prediction or a
predicted rule.
Here's another dotted rule:
" termbinop -> term X ASSIGNOP term"
In this dotted rule, we are saying we have seen a "term", but have not
yet recognized an "ASSIGNOP".
There's another special kind of dotted rule, a completion. A
completion
(also called a
completed rule) is a dotted rule with the dot after all
of the symbols. Here is the completion for the rule that we have been using as
an example:
" termbinop -> term ASSIGNOP term X"
A completion indicates that a rule has been fully recognized.
Earley items¶
The dotted rules contain all but one piece of the information that Earley's
algorithm needs to track. The missing piece is the second of the two
"wheres": where in the input stream. To associate input stream
location and dotted rules, Earley's algorithm uses what are now called Earley
items.
A convenient way to think of an
Earley item is as a triple, or 3-tuple,
consisting of dotted rule, origin and current location. The
origin is
the location in the input stream where the dotted rule starts. The
current
location (also called the
dot location) is the location in the
input stream which corresponds to the dot position.
Two noteworthy consequences follow from the way in which origin and current
location are defined. First, if a dotted rule is a prediction, then origin and
current location will always be the same. Second, the input stream location
where a rule ends is not tracked unless the dotted rule is a completion. In
other cases, an Earley item does not tell us if a rule will ever be completed,
much less at which location.
The example¶
For this example of debugging, I've taken a very common example of a grammar and
deliberately introduced a problem. (All the code and the full trace outputs
for this example are in the Appendix.) I've commented out the correct start
rule:
## { lhs => 'Expression', rhs => [qw/Term/] },
and replaced it with another start rule, one which will cause problems:
{ lhs => 'Expression', rhs => [qw/Factor/] },
In what follows, we'll pretend we don't already know where the problem is, and
use the Marpa diagnostics and tracing facilities to "discover" it.
First warning¶
Right off the bat, we get two warning messages:
Inaccessible symbol: Add
Inaccessible symbol: Term
If we were alert, these would be enough to tell us there is a serious problem.
"Inaccessible" symbols are symbols which cannot be reached from the
start symbol. This means that the grammar will never produce them, and that
parses will never find them in the input.
Since "Add" and "Term" are both important symbols in our
application, that should tell us our grammar has a serious problem. In fact,
these warning messages would often be enough to point us to the error. But, in
order to look at more of Marpa's tracing facilities, let's pretend we have not
had our morning coffee, and that we miss the significance of these warning
messages.
Output from trace_terminals()¶
Before looking at Marpa's progress reports, it is usually best to orient
yourself by looking at the output from "trace_terminals". Typically,
you will be interested in the last tokens to be accepted, and perhaps also in
the tokens the recognizer was looking for when it didn't find what it wanted.
Sometimes that information alone is enough to make it clear where the problem
is.
The full "trace_terminals" output for this example is in the Appendix.
We see that the recognizer seems to accept ""42*1"" but it
fails when it confronts the plus sign (""+""). The last
two lines are:
Accepted "Number" at 2-3
Expecting "Multiply" at 3
A note in passing: Marpa shows the location of the tokens it accepts as a range
of locations. For "Number", the range is
""2-3"", indicating that the token starts at location 2
and ends at location 3. In its default input model, all tokens have length 1,
so this is clearly information overkill. But Marpa allows other input models,
and in those models the information about start and end location of the token
is important.
Returning to the problem at hand: We notice that at location 3 we are expecting
a "Multiply" operator, but not an "Add" operator. That
should strike us as strange, and send us back to the grammar. But for the sake
of our example we will assume that we are slow on the uptake today, and that
this does not clue us in. We move on.
Output from show_progress()¶
Marpa's most powerful tool for debugging grammars is its progress report, which
shows the Earley items being worked on. In the Appendix, progress reports for
the entire parse are shown. Our example in this document is a very small one,
so that producing progress reports for the entire parse is a reasonable thing
to do in this case. If a parse is at all large, you will usually need to be
selective.
The progress report that is usually of most interest is the one for the Earley
set that you were working on when the error occurred. This is called the
current location. In our example the current location is location 3. By
default, "show_progress" prints out only the progress reports for
the current location.
Here are the progress reports for the current location, location 3, from our
example.
F0 @0-3 Expression -> Factor .
F2 @2-3 Factor -> Number .
R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
F4 @0-3 Factor -> Factor Multiply Factor .
Progress report lines¶
The last field of each Progress Report line shows, in fully expanded form, the
dotted rule we were working on. Since that is the most important information,
it may be tempting to skip the rest of this section, and move directly forward
with the debugging.
In fact, you might want to do exactly that -- skip to the beginning of the next
section. What follows talks about the details of the format of the first few
fields in each progress report line. These first few fields, while helpful,
are also usually one or more of obvious in their meaning, not relevant to our
example, and repetitive of information which can be deduced from other fields.
F0 @0-3 Expression -> Factor .
Prefixed to the dotted rule are two fields: ""F0 @0-3"". The
""F0"" says that this is a completed or
final rule,
and that it is rule number 0. The rule number is used in other tracing and
debugging output, when displaying the whole rule would take too much space. In
what follows we won't need the rule number.
The ""@0-3"" describes the location of the dotted rule in
the parse. In its simplest form, the location field is two location numbers,
separated by a hyphen. The first location number is the origin, the place
where Marpa first started recognizing the rule. The last location number is
the dot location, the location location of the dot in a dotted rule.
""@0-3"" say that this rule began at location 0, and that
the dot is at location 3.
The current location is location 3, and this is no coincidence. Whenever we are
displaying the progress report for an location, all the progress report lines
will have their dot location at that location.
As an aside, notice that the left hand side symbol is "Expression".
That is the start symbol. The presence of a completed start rule in our
progress report indicates that if our input ended at location 3, it would be a
valid sentence in the language of our grammar.
Let's look at another progress report line:
R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
Here the ""R4:1"" indicates that this is rule number 4 (the
""R"" stands for rule number) and that its dot position is
after the first symbol on the right hand side. Symbol positions are numbered
using the ordinal of the symbol just before the position. Symbols are numbered
starting with 1, and symbol position 1 is the position immediately after
symbol 1.
The next field (""x2"") is new. It is a count. A progress
report can contain multiple instances of the same dotted rule, and when there
is more than one, a count field is included in the progress report line. Here
the ""x2"" indicates that there are two instances of
"Factor -> Factor . Multiply Factor" at this location.
Multiple instances of a dotted rule will differ in their origin, and where they
do, this is shown in the location field of the progress report line. Here the
location field is ""@0,2-3"", which indicates that one
instance of this dotted rule has its origin at location 0, and the other has
its origin at location 2. All instances reported on a single progress report
line will always have the same dot location, and in this case it is location
3.
Predicted rules also appear in progress reports:
P2 @2-2 Factor -> . Number
Here the ""P"" in the summary field means
"predicted". As with much of the information in the summary field,
this only repeats what is obvious from the full expansion of the dotted rule
later in the line. But final (or completed) and predicted rules can be
important and the initial "F" and "P" make these lines
easy to spot.
Notice that in the predicted rule, the origin is the same as the dot location.
This will always be the case with predicted rules.
For any given location, no predicted rule has more than one instance. For other
dotted rules, there may be many instances of the dotted rule at a single
location. In grammars with right recursion, the number of instances is limited
only by the length of the recursion. The length of a recursion is limited
primarily by the available memory.
When there are many instances of a dotted rule at a single location, it is
inconvenient to show all the origins in a comma-separated list. In that case
the origins in the location field are shown as a range, with the earliest
separated from the most recent by a ""..."". The example
in this document contains no lines with a large number of instances, but here
is an example from another grammar. This is the progress report line for the
completed rule in a right recursion of length 20.
F1 x20 @0...19-20 Top_sequence -> Top Top_sequence .
OK! Now to find the bug¶
Here again are progress reports at the location where things went wrong:
F0 @0-3 Expression -> Factor .
F2 @2-3 Factor -> Number .
R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
F4 @0-3 Factor -> Factor Multiply Factor .
We see that we have completed rules for "Expression", and
"Factor", as expected. We also see two Earley items that show that
we are in the process of building another "Factor", and that it is
expecting a "Multiply" symbol. This is not the rule we want, but it
explains why the "trace_terminals" output showed that the recognizer
was expecting a "Multiply" symbol.
What we want to know is, why is the recognizer
not expecting an
"Add" symbol? Looking back at the grammar, we see that only one rule
uses the "Add" symbol: the rule ""Term -> Term Add
Term"". The next step is to look at the Earley items for this rule.
But there is a problem. We don't find any.
Next, we ask ourselves, what is the earliest place the ""Term ->
Term Add Term"" rule should be appearing? The answer is that there
should be a prediction of ""Term -> Term Add Term"" at
location 0. So we look at the predictions at location 0.
P0 @0-0 Expression -> . Factor
P2 @0-0 Factor -> . Number
P4 @0-0 Factor -> . Factor Multiply Factor
No ""Term -> Term Add Term"" rule. We are never even
predicting a ""Term -> Term Add Term"" rule. We look
back at the grammar, and start from the beginning.
{ lhs => 'Expression', rhs => [qw/Factor/] },
{ lhs => 'Term', rhs => [qw/Factor/] },
{ lhs => 'Factor', rhs => [qw/Number/] },
{ lhs => 'Term',
rhs => [qw/Term Add Term/],
action => 'do_add'
},
{ lhs => 'Factor',
rhs => [qw/Factor Multiply Factor/],
action => 'do_multiply'
},
The start symbol is "Expression" and we do see a rule with
"Expression" on the left hand side. "Expression" in turn
produces a "Factor" symbol, and there are two rules with
"Factor" on the left hand side.
But none of these rules ever produce a "Term". In fact, however far we
follow the productions, no rule ever produces a "Term". At this
point we see the problem: If we start at the start symbol, and follow the
rules of our grammar, we will never get to a "Term" symbol. Which is
exactly what that first warning message was saying.
Now that we know what is wrong, we can reread our grammar, and see that our
"Expression -> Factor" rule is wrong. It should be
"Expression -> Term". Change that and the problem is fixed.
Empty rules¶
When a symbol is nulled in your parse, "show_progress" does not show
its expansion into rules. This reduces clutter, and seems to be the intuitive
way to treat nulled rules.
This section deals with the "progress()" recognizer method, which
allows access to the raw progress report information. This method is not
needed for typical debugging and tracing situations. It is intended for
applications which want to leverage Marpa's "situational awareness"
in innovative ways.
progress()¶
my $report0 = $recce->progress(0);
my $latest_report = $recce->progress();
Given the parse location (Earley set ID) as its argument, the
"progress()" recognizer method returns a reference to an array of
"report items". The parse location may be negative. An argument of
-X will be interpreted as location
N+X+1, where
N is the
latest Earley set. This means that an argument of -1 indicates the latest
Earley set, an argument of -2 indicates the Earley set just before the latest
one, etc.
Each report item is a triple: an array of three elements. The three elements
are, in order, rule ID, dot position, and origin. The data returned by the two
displays above, as well as the data for the other parse locations in our
example, are shown below.
The rule ID is the same number that Marpa uses to identify rules in tracing and
debugging output. Given a rule ID, an application can find its LHS and RHS
symbols using the grammar's "rule()" method.
Dot position is -1 for completions, and 0 for predictions. Where the report item
is not for a completion or a prediction, dot position is
N, where
N is the number of RHS symbols successfully recognized at the parse
location of the progress report.
Origin is parse location (Earley set ID) at which the rule application reported
by the report item began. For a prediction, origin will always be the same as
the location of the parse report.
Progress reports and efficiency¶
When progress reports are used for production parsing, instead of just for
debugging and tracing, efficiency considerations become significant. We will
start with the good news. "progress()" is implemented in C, so that
the application can usually expect calls to "progress()" to be
extremely fast.
Now, as to the potential bad news: Applications planning frequent use of calls
to "progress()" need to be aware that, where there is a right
recursion at a parse location, "progress()" needs to follow the
entire chain of recursions to create the progress report. That this is
happening will always be evident from the parse report itself, which will
contain one report item for every completion in the right-recursive chain of
completions. If an application tries to produce progress reports for a large
number of parse locations, and these parse locations have long right recursive
chains, it can prove computationally expensive.
The efficiency consideration just mentioned for following right recursions is
not an issue for left recursions. Left recursions only produce at most two
report items per parse location and are extremely fast to process. It is also
not an issue for Marpa's sequence rules, because sequence rules are
implemented internally as left recursions.
Appendix: full code and output for the example¶
Below are the code, the trace outputs and the progress report for the example
used in this document.
Code¶
my $grammar = Marpa::R2::Grammar->new(
{ start => 'Expression',
actions => 'My_Actions',
default_action => 'first_arg',
rules => [
## This is a deliberate error in the grammar
## The next line should be:
## { lhs => 'Expression', rhs => [qw/Term/] },
## I have changed the Term to 'Factor' which
## will cause problems.
{ lhs => 'Expression', rhs => [qw/Factor/] },
{ lhs => 'Term', rhs => [qw/Factor/] },
{ lhs => 'Factor', rhs => [qw/Number/] },
{ lhs => 'Term',
rhs => [qw/Term Add Term/],
action => 'do_add'
},
{ lhs => 'Factor',
rhs => [qw/Factor Multiply Factor/],
action => 'do_multiply'
},
],
}
);
$grammar->precompute();
my @tokens = (
[ 'Number', 42 ],
[ 'Multiply', q{*} ],
[ 'Number', 1 ],
[ 'Add', q{+} ],
[ 'Number', 7 ],
);
sub My_Actions::do_add {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 + $t2;
}
sub My_Actions::do_multiply {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 * $t2;
}
sub My_Actions::first_arg { shift; return shift; }
my $recce = Marpa::R2::Recognizer->new(
{ grammar => $grammar, trace_terminals => 2 } );
my $token_ix = 0;
TOKEN: for my $token_and_value (@tokens) {
last TOKEN if not defined $recce->read( @{$token_and_value} );
}
$progress_report = $recce->show_progress( 0, -1 );
Trace output¶
Inaccessible symbol: Add
Inaccessible symbol: Term
Setting trace_terminals option
Expecting "Number" at earleme 0
Accepted "Number" at 0-1
Expecting "Multiply" at 1
Accepted "Multiply" at 1-2
Expecting "Number" at 2
Accepted "Number" at 2-3
Expecting "Multiply" at 3
Rejected "Add" at 3-4
Note the use of the term "earleme". If you are using the default input
model, you can assume that earleme means "location": the earleme and
the location will always be exactly the same. Advanced users, using
alternative input models, may set it up so that earleme and location are two
different things, and in that case the distinction will matter.
show_progress() output¶
P0 @0-0 Expression -> . Factor
P2 @0-0 Factor -> . Number
P4 @0-0 Factor -> . Factor Multiply Factor
F0 @0-1 Expression -> Factor .
F2 @0-1 Factor -> Number .
R4:1 @0-1 Factor -> Factor . Multiply Factor
P2 @2-2 Factor -> . Number
P4 @2-2 Factor -> . Factor Multiply Factor
R4:2 @0-2 Factor -> Factor Multiply . Factor
F0 @0-3 Expression -> Factor .
F2 @2-3 Factor -> Number .
R4:1 x2 @0,2-3 Factor -> Factor . Multiply Factor
F4 @0-3 Factor -> Factor Multiply Factor .
progress() outputs¶
These section contains the output of the "progress()" method -- the
progress reports in their "raw" format. The output is shown in
Data::Dumper format, with "Data::Dumper::Indent" set to 0 and
"Data::Dumper::Terse" set to 1.
The "Data::Dumper" output from "progress()" at location 0:
[[0,0,0],[2,0,0],[4,0,0]]
The "Data::Dumper" output from "progress()" at location 1:
[[0,-1,0],[2,-1,0],[4,1,0]]
The "Data::Dumper" output from "progress()" at location 2:
[[2,0,2],[4,0,2],[4,2,0]]
The default "progress()" output is for the latest Earley set, which is
location 3 in our example. Here is the "progress()" output for
location 3.
[[0,-1,0],[2,-1,2],[4,-1,0],[4,1,0],[4,1,2]]
Copyright and License¶
Copyright 2014 Jeffrey Kegler
This file is part of Marpa::R2. Marpa::R2 is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.
Marpa::R2 is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser
General Public License along with Marpa::R2. If not, see
http://www.gnu.org/licenses/.