NAME¶
Marpa::R2::Advanced::Models - Other input models
About this document¶
The alternative input models described in this document are an advanced
technique. If you are starting out with Marpa, you probably want to ignore
this document. If you are an experienced Marpa user, it is still safe to
ignore this document, but you might find the possibilities it discusses
interesting.
Marpa has two different ideas of location¶
In the other Marpa documentation, we have spoken of "location", and
assumed the standard input model. The locations actually in use by the methods
described for the standard input model were Earley set ordinals. (An Earley
set's ordinal is also its ID.)
Marpa actually has two different ideas of location -- Earley set ordinal and
earleme. This is ignored in the other Marpa documents, and it can be, because
they assume the standard input model. Use of the standard input model
guarantees that earleme and Earley set ordinal will always be exactly the
same.
This document introduces methods which make it possible (and in fact likely)
that earleme and Earley set ordinal will differ. From here on, the reader will
need to pay careful attention to the distinction.
An alternative input model is anything that is not the default, token-stream
model. More helpfully, Marpa allows variable-length tokens and ambiguous
tokens, and an alternative input model is any input model which
- •
- Allows a token whose length is not exactly 1, or
- •
- Allows locations which have more than one token.
To do either of these things, a user must use the recognizer's
"alternative" method. In other words, if an application is not
directly using the recognizer's "alternative" method call, that
application is not using an alternative input method.
Many concepts, such as parsing location, parse exhaustion, and the end of
parsing, are somewhat more complicated when alternative input models are
involved. These concepts were explained in the main document for the
recognizer on the assumption that the default input model was in use. This
document revises those explanations as necessary to take into account the
alternative input models.
Token streams¶
Marpa's default input model is the traditional one -- a token stream. Token
streams are very standard in parsing applications -- so much so that most
texts do not take the trouble of defining the term. A
token stream is
input structured as a sequence of tokens, where each token occupies one
location and every location has a token. In the token stream model, all tokens
are of the same length.
Conventionally, all tokens are of length 1, and the token stream starts at
location 0. Following this convention, the
Nth token would start at
location
N-1 and end at location
N. For example, the first token
would start at location 0 and end at location 1. The second token would start
at location 1 and end at location 2.
Earlemes¶
For most parsers, position is location in a token stream. To deal with
variable-length and overlapping tokens, Marpa needs a more flexible idea of
location.
Marpa's tracks position in terms of
earlemes.
Earlemes are named
after Jay Earley, the inventor of the first algorithm in Marpa's lineage.
Every token has a start earleme and an end earleme.
The token stream model may also be called the token-per-earleme model. In the
token stream model, token location and earleme location are exactly identical.
More formally, in the token stream model, if the token location is
N,
then the earleme location is also
N. If a user's application uses the
token stream model, the user can ignore the existence of earlemes, and can
think entirely in terms of tokens and their position in a token stream.
Because of this, the main Marpa documents often speak simply of the
"location" in the parse.
The furthest earleme¶
The
furthest earleme is the last earleme at which a token ends. In the
default input model, the furthest earleme and the current earleme are always
the same. As a result, in the default input model, the furthest earleme is not
an important concept.
In alternative input models, tokens may be longer than 1 earleme, and the
furthest earleme and the current earleme may be far apart. This becomes an
issue when parsing is finished. Alternative input models use the recognizer's
"end_input" method to ensure that processing of input catches up to
the furthest earleme.
The latest Earley set and latest earleme¶
The
latest earleme is the earleme location of the latest Earley set. In
the default input model, the latest earleme is always the same as the current
earleme.
In alternative input models, there may not be an Earley set at a given earleme
location. When that is the case for the current earleme, then the latest
Earley set is not at the current earleme, and the latest earleme and current
earlemes are different.
Methods¶
alternative()¶
$recce->alternative( 'a', \42, 1 ) or return 'First alternative failed';
The "alternative" method is the most general method for reading input,
and is used in alternative input models. It takes three arguments, only the
first of which is required.
The first two arguments are the token type and a reference to the token value.
If the reference to the token value is omitted, or is "undef", the
value of the token will be a Perl "undef".
The third argument to the "alternative" method is a token length. If
omitted, token length defaults to 1, which is the correct value for the token
stream model. Its value can be any integer
greater than zero. Marpa
does not allow zero length tokens in any input model.
Unlike the "read" method, the "alternative" method does
not advance the current location (current earleme) on each call. This
allows the application to read several tokens at the same earleme. This is how
ambiguous input is created. To advance the current earleme when input is read
using the "alternative" method, the "complete_earleme"
method must be called.
On success, "alternative" returns a Perl true value. If the token is
rejected, "alternative" returns a Perl false. On other failures,
"alternative" throws an exception.
current_earleme()¶
$current_earleme = $recce->current_earleme();
Returns the current parse location, also known as the current earleme. Not often
needed.
earleme()¶
my $origin_earleme = $recce->earleme($origin_earley_set_id);
Given an Earley set ID as its argument, the "earleme()" recognizer
method returns the corresponding earleme. Every integer in the range from 0 to
the ID of the latest Earley is a valid Earley set ID, and every valid Earley
set ID corresponds to an earleme. If the argument of "earleme()" is
greater than the latest Earley set ID, "earleme()" returns Perl
"undef".
There is currently no method that translates from earleme to Earley set. Earley
set to earleme translation is a well-behaved one-to-one function in all input
models -- for every Earley set there is a earleme, and every earleme is mapped
to by at most one Earley set. Earleme to Earley set translation is far less
well-behaved. In many input models, it is a partial function -- there are some
earlemes that are in the valid range of earlemes but do not map to any Earley
set.
Earleme to Earley set translation is often not needed. When it is, it can be
implemented at the application level, with the application taking advantage of
what it knows about its choice of input model.
earleme_complete()¶
$recce->earleme_complete();
Processes all tokens at the current earleme and advances the current earleme by
1. If the earleme cannot be completed, an exception is thrown. Otherwise, an
"event" count is returned. If zero is returned, the earleme was
completed without event. In this context, an "event" is one of a
list of occurrences of special interest during the successful completion of an
earleme, as described for the recognizer's "read" method.
All tokens read using the "alternative" method start at one location
-- the current earleme. When reading input using the "alternative"
method, "earleme_complete" is used to complete processing at the
current earleme and move forward in the input stream.
"earleme_complete" may be called even if the "alternative"
method has been not called since the last call to
"earleme_complete". This will create an earleme with no tokens. In
certain input models, such as the character-per-earleme model, this can be
useful.
$recce->end_input();
Indicates that input is finished. Calling "end_input" is not necessary
or useful in the default input model, because in the default input model no
token has a length greater than 1.
The "end_input" method takes no arguments. The "end_input"
method returns a Perl true value on success. On failure, it throws an
exception.
In alternative input models, calling the "earleme_complete" method
once input is finished does not ensure that all input has been processed. The
"earleme_complete" method completes the current earleme, but in
alternative models, tokens may extend well past the current earleme. The
"end_input" method ensures that all input is processed.
Calling "end_input" multiple times on the same recognizer object is
harmless, but useless. The second and subsequent calls will return a Perl
true, but will have no effect.
Alternative models and "read()"¶
A recognizer can mix calls to its "read" method with calls to its
"alternative" method. The "read" method has the same
effect as a single call to the "alternative" method, followed
immediately by a call of the "earleme_complete" method.
Ambiguous lexing¶
Marpa allows ambiguous tokens. Several Marpa tokens can start at a single
parsing location. Ambiguous tokens can be of various lengths. Tokens can also
overlap.
Potentially ambiguous lexing occurs when more than one token
starts at a single earleme. When potentially ambiguous lexing occurs, it
becomes possible for there to be more than one sequence of tokens.
An
actual lexical ambiguity only occurs if more than one of the potential
token sequences is consistent with the grammar. If there is no actual lexical
ambiguity, Marpa will use the only token choice that is consistent with the
grammar.
When lexing is
actually ambiguous, Marpa will use all the alternatives
consistent with the grammar. When the lexing in a parse is actually ambiguous,
the parse will be ambiguous. From the point of view of Marpa's semantics,
ambiguities caused by lexing look the same as ambiguities caused by an
ambiguous grammar.
In the standard terminology, if a grammar produces more than one parse tree for
any input, then that grammar must be ambiguous. In Marpa this is not strictly
true. In Marpa, if the input is ambiguous, even an unambiguous grammar can
produce more than one parse.
Duplicate tokens¶
A duplicate token is a token of the same type and the same length as another
that was read at the same earleme. Duplicate tokens are impossible in the
default, token-stream, model. This is because in the token-stream model only
one token can be read at each earleme.
In alternative models, more than one token may be read at an earleme, and
duplicates
are possible. Marpa detects duplicate tokens and treats them
as "hard errors" -- Marpa throws an exception when it sees a
duplicate token. Marpa's assumption is that duplicate tokens indicate an error
at the application level.
An application can retry input after a duplicate token, if it catches the
exception. In the future, if recovery from duplicate tokens is found to be a
useful technique, Marpa may provide an option to change its behavior, so that
a soft failure is returned when there is a duplicate token.
Earlemes: the details¶
While scanning, Marpa keeps track of the
current earleme. Earlemes in a
parse start at earleme 0 and increase numerically. The earleme immediately
following earleme 0 is earleme 1, the earleme immediately following earleme 1
is earleme 2, and so on. The earleme immediately following earleme
N is
always earleme
N+1.
Distance in the earleme stream is what you would intuitively expect it to
be. The distance between earleme
X and earleme
Y is the absolute
value of the difference between
X and
Y,
|X-Y|. The
distance from earleme 3 to earleme 6, for example, is 3 earlemes.
Whenever a token is given to Marpa to be scanned, it starts at the current
earleme. In addition to the type and value of the token, Marpa must be told
the token's
length in earlemes. The length of a Marpa token must be
greater than zero.
This earleme length will become the distance from the start of the token to the
end of the token. If the length of the token is
L, and the current
earleme is
C, the end of the token will be at earleme
C+L.
The character-per-earleme model¶
Many different models of the relationship between tokens and earlemes are
possible, but two are particularly important. One is the one-token-per-earleme
model, which is the default, and which has already been described. The other
is the one-character-per-earleme model.
In the one-character-per-earleme model, every character will be treated as being
exactly one earleme in length. If a token is more than one character in
length, that token will span earlemes. When the lexing is ambiguous, tokens
may overlap.
When a one-character-per-earleme model of input is used, there may be many
earlemes at which no tokens start. For example, in a straightforward
character-per-earleme implementation of a grammar for a language that allows
comments, no tokens will start at any earlemes which correspond to character
locations inside a comment.
So far only the token-per-earleme and character-per-earleme models have seen any
real use in Marpa programs. But other models are certainly possible. Using
earlemes, you can structure your input in almost any way you like.
There are only three restrictions:
- 1.
- Scanning always starts at earleme 0.
- 2.
- All tokens starting at earleme N must be scanned before any tokens
starting at earleme N+1. In other words, the tokens must be scanned
in non-decreasing order by start earleme.
- 3.
- Every token must have a length, in earlemes, which is greater than zero.
In other words, token length can never be zero or negative.
Copyright and License¶
Copyright 2014 Jeffrey Kegler
This file is part of Marpa::R2. Marpa::R2 is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.
Marpa::R2 is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser
General Public License along with Marpa::R2. If not, see
http://www.gnu.org/licenses/.