.\" Automatically generated by Pod::Man 2.28 (Pod::Simple 3.28) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is turned on, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{ . if \nF \{ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" .\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2). .\" Fear. Run. Save yourself. No user-serviceable parts. . \" fudge factors for nroff and troff .if n \{\ . ds #H 0 . ds #V .8m . ds #F .3m . ds #[ \f1 . ds #] \fP .\} .if t \{\ . ds #H ((1u-(\\\\n(.fu%2u))*.13m) . ds #V .6m . ds #F 0 . ds #[ \& . ds #] \& .\} . \" simple accents for nroff and troff .if n \{\ . ds ' \& . ds ` \& . ds ^ \& . ds , \& . ds ~ ~ . ds / .\} .if t \{\ . ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u" . ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u' . ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u' . ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u' . ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u' . ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u' .\} . \" troff and (daisy-wheel) nroff accents .ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V' .ds 8 \h'\*(#H'\(*b\h'-\*(#H' .ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#] .ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H' .ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u' .ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#] .ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#] .ds ae a\h'-(\w'a'u*4/10)'e .ds Ae A\h'-(\w'A'u*4/10)'E . \" corrections for vroff .if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u' .if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u' . \" for low resolution devices (crt and lpr) .if \n(.H>23 .if \n(.V>19 \ \{\ . ds : e . ds 8 ss . ds o a . ds d- d\h'-1'\(ga . ds D- D\h'-1'\(hy . ds th \o'bp' . ds Th \o'LP' . ds ae ae . ds Ae AE .\} .rm #[ #] #H #V #F C .\" ======================================================================== .\" .IX Title "Marpa::R2::Vocabulary 3pm" .TH Marpa::R2::Vocabulary 3pm "2015-03-06" "perl v5.20.2" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" Marpa::R2::Vocabulary \- Standard parsing terms as used within Marpa .SH "Description" .IX Header "Description" The definitions in this document are of standard parsing terms as they are used in the Marpa context. I put \fBdefining uses\fR of terms in boldface, for easy skimming. Where a narrow or specialized sense of the term is the one that applies within Marpa, that is the only definition given. Marpa also sometimes uses a standard term with a definition which is slightly different from the standard one. (\*(L"Ambiguous grammar\*(R" is one example, and \*(L"grammar\*(R" itself is another.) When this is the case, it is explicitly pointed out. .PP A reader totally new to parsing will find this document too terse to act as a textbook or tutorial. They will need to look elsewhere first. As an introduction, I recommend Mark Jason Dominus's excellent chapter on parsing in the Perl context. It's available on-line. Wikipedia is also an excellent place to start. .SS "Basic" .IX Subsection "Basic" A \fBgrammar\fR is a set of rules, associated with a set of symbols, one of which is distinguished as the start symbol. A \fBsymbol string\fR, or simply \fBstring\fR where the meaning is clear, is an ordered series of symbols. The \fBlength\fR of a string is the number of symbols in it. .PP A \fBlanguage\fR is a set of \fBsymbol strings\fR. A grammar defines a \fBlanguage\fR, as will be described later. .PP It is important to note that the term language, as it is used in parsing theory, means something very different from what it means in ordinary use. The meaning of the strings is an essential part of the ordinary idea of what a language is. In ordinary use, the word \*(L"language\*(R" means far more than a unordered list of its sentences. In parsing terminology, meaning (or \fBsemantics\fR as it is called) is a separate issue. For parsing theory a language is exactly a set of strings \*(-- that and nothing more. .PP The Marpa definition of a grammar differs slightly from the various standard ones. Standard definitions usually sharply distinguish terminal symbols from non-terminals. Marpa does not. .SS "Stages of parsing" .IX Subsection "Stages of parsing" A \fBrecognizer\fR is a program that determines whether its \fBinput\fR is in the language of a grammar and a start symbol. A \fBparser\fR is a program which finds the structure of that input. .PP The term \fBparsing\fR is used in a strict and a loose sense. \&\fBParsing in the loose sense\fR is all phases of finding a grammar's structure, including a separate recognition phase if the parser has one. (Marpa does.) If a parser has phases, \&\fBparsing in the strict sense\fR refers specifically to the phase that finds the structure of the input. When the Marpa documents use the term \fBparsing\fR in its strict sense, they will speak explicitly of \*(L"parsing in the strict sense\*(R". Otherwise, \fBparsing\fR will mean parsing in the loose sense. .PP Parsers often use a \&\fBlexical analyzer\fR to convert \fBraw input\fR, usually \fBinput text\fR, into a \fBtoken stream\fR, which is a series of \fBtokens\fR. Each token represents a \fBsymbol\fR of the grammar and has a \fBvalue\fR. A lexical analyzer is often called a \fBlexer\fR or a \fBscanner\fR, and \fBlexical analysis\fR is often called \fBlexing\fR or \fBscanning\fR. .PP The series of symbols represented by the series of tokens becomes the \fBsymbol string input\fR seen by the recognizer. The \fBsymbol string input\fR is more often called the \fBinput sentence\fR. .PP By default, Marpa uses the token stream model of input. Marpa also allows alternative input models. These are new to Marpa, so that their terminology is of necessity non-standard. The terminology needed for alternative input models is explained in the document that introduces them. .SS "Rules" .IX Subsection "Rules" A standard way of describing rules is Backus-Naur Form, or \fB\s-1BNF\s0\fR. A rule of a grammar is also often called a \fBproduction\fR. In one common way of writing \s-1BNF,\s0 a production looks like this: .PP .Vb 1 \& Expression ::= Term Factor .Ve .PP In the production above, \f(CW\*(C`Expression\*(C'\fR, \f(CW\*(C`Term\*(C'\fR and \f(CW\*(C`Factor\*(C'\fR are symbols. A production consists of a \fBleft hand side\fR and a \fBright hand side\fR. In a \fBcontext-free grammar\fR, like those Marpa parses, the left hand side of a production is always a symbol string of length 1. The right hand side of a production is a symbol string of zero or more symbols. In the example, \f(CW\*(C`Expression\*(C'\fR is the left hand side, and \&\f(CW\*(C`Term\*(C'\fR and \f(CW\*(C`Factor\*(C'\fR are right hand side symbols. .PP Left hand side and right hand side are often abbreviated as \fB\s-1RHS\s0\fR and \fB\s-1LHS\s0\fR. If the \s-1RHS\s0 of a production has no symbols, the production is called an \fBempty production\fR or an \fBempty rule\fR. .PP Any symbol which is allowed to occur in the symbol string input is called a \fBterminal\fR symbol. If the symbols in a symbol string are all terminals, that symbol string is also called a \fBsentence\fR. .SS "Derivations" .IX Subsection "Derivations" A \fBstep\fR of a derivation, or \fBderivation step\fR, is a change made to a symbol string by applying one of the productions from the grammar. The production must be one of those with a \s-1LHS\s0 that occurs in the symbol string. The result of the derivation step is another symbol string, one in which every occurence of the \s-1LHS\s0 symbol from the production is replaced by the \s-1RHS\s0 of the production. For example, if \f(CW\*(C`A\*(C'\fR, \f(CW\*(C`B\*(C'\fR, \f(CW\*(C`C\*(C'\fR, \f(CW\*(C`D\*(C'\fR, and \f(CW\*(C`X\*(C'\fR are symbols, and .PP .Vb 1 \& X ::= B C .Ve .PP is a production, then .PP .Vb 1 \& A X D \-> A B C D .Ve .PP is a derivation step, with "\f(CW\*(C`A X D\*(C'\fR\*(L" as its beginning and \*(R"\f(CW\*(C`A B C D\*(C'\fR\*(L" as its end or result. We say that the symbol string \*(R"\f(CW\*(C`A X D\*(C'\fR" \&\fBderives\fR the symbol string "\f(CW\*(C`A B C D\*(C'\fR". .PP A \fBderivation\fR is a sequence of derivation steps. The \fBlength\fR of a derivation is its length in steps. .IP "\(bu" 4 We say that a first symbol string \fBdirectly derives\fR a second symbol string if and only if there is a derivation of length 1 from the first symbol string to the second symbol string. .IP "\(bu" 4 Every symbol string is said to derive itself in a derivation of length 0. A zero length derivation is a \fBtrivial derivation\fR. .IP "\(bu" 4 A derivation which is not trivial (that is, a derivation which has one or more steps) is a \fBnon-trivial\fR derivation. .IP "\(bu" 4 A non-trivial derivation of a symbol string from itself is called a \fBcycle\fR. Grammars which contain cycles are traditionally considered useless, but Marpa will parse with such grammars. .IP "\(bu" 4 If a derivation is not trivial or direct, that is, if it has more than one step, then it is an \fBindirect\fR derivation. .PP Technically, a symbol \f(CW\*(C`X\*(C'\fR and a string that consists of only that symbol are two different things. But we often say "the symbol \f(CW\*(C`X\*(C'\fR\*(L" as shorthand for \*(R"the string of length 1 whose only symbol is \f(CW\*(C`X\*(C'\fR". For example, if the string containing only the symbol \f(CW\*(C`X\*(C'\fR derives a string \f(CW\*(C`Y\*(C'\fR, we will usually say simply that "\f(CW\*(C`X\*(C'\fR derives \f(CW\*(C`Y\*(C'\fR". .PP Wherever symbol or string \f(CW\*(C`X\*(C'\fR derives \f(CW\*(C`Y\*(C'\fR, we may also say \f(CW\*(C`X\*(C'\fR \fBproduces\fR \f(CW\*(C`Y\*(C'\fR. Derivations are often described as symbol matches. Wherever symbol or string \f(CW\*(C`X\*(C'\fR derives \f(CW\*(C`Y\*(C'\fR, we may also say that \f(CW\*(C`Y\*(C'\fR \fBmatches\fR \f(CW\*(C`X\*(C'\fR or that \f(CW\*(C`X\*(C'\fR \fBmatches\fR \f(CW\*(C`Y\*(C'\fR. It is particularly common to say that \&\f(CW\*(C`X\*(C'\fR matches \f(CW\*(C`Y\*(C'\fR when \f(CW\*(C`X\*(C'\fR or \f(CW\*(C`Y\*(C'\fR is a sentence. .PP The parse of an input by a grammar is \fBsuccessful\fR if and only if, according to the grammar, the start symbol produces the input sentence. The set of all input sentences that a grammar will successfully parse is the \fBlanguage\fR of the grammar. .SS "Nulling" .IX Subsection "Nulling" The zero length symbol string is called the \fBempty string\fR. The empty string can be considered to be a sentence, in which case it is the \fBempty sentence\fR. A string of one or more symbols is \fBnon-empty\fR. A derivation which produces the empty string is a \fBnull derivation\fR. A derivation from the start symbol which produces the empty string is a \fBnull parse\fR. .PP If in a particular grammar, a symbol has a null derivation, it is a \fBnullable symbol\fR. If, in a particular grammar, the only sentence produced by a symbol is the empty sentence, it is a \fBnulling symbol\fR. All nulling symbols are nullable symbols. .PP If a symbol is not nullable, it is \fBnon-nullable\fR. If a symbol is not nulling, it is \fBnon-nulling\fR. In any instance where a symbol produces the empty string, it is said to be \fBnulled\fR, or to be a \fBnull symbol\fR. .SS "Useless rules" .IX Subsection "Useless rules" If any derivation from the start symbol uses a rule, that rule is called \fBreachable\fR or \fBaccessible\fR. A rule that is not accessible is called \fBunreachable\fR or \fBinaccessible\fR. If any derivation which results in a sentence uses a rule, that rule is said to be \fBproductive\fR. A rule that is not productive is called \fBunproductive\fR. For example, a rule is unproductive unless every symbol on its \s-1RHS\s0 either is a terminal or is the \s-1LHS\s0 of some other rule. A rule which is inaccessible or unproductive is called a \&\fBuseless\fR rule. Marpa can handle grammars with useless rules. .PP A symbol is \fBreachable\fR or \fBaccessible\fR if it appears in a reachable production. If a symbol is not reachable, it is \fBunreachable\fR or \fBinaccessible\fR. A symbol is \fBproductive\fR if it appears on the \s-1LHS\s0 of a productive rule, or if it is a nullable symbol. If a symbol is not productive, it is \fBunproductive\fR. A symbol which is inaccessible or unproductive is called a \&\fBuseless\fR symbol. Marpa can handle grammars with useless symbols. .SS "Recursion and cycles" .IX Subsection "Recursion and cycles" If any symbol in the grammar non-trivially produces a symbol string containing itself, the grammar is said to be \fBrecursive\fR. If any symbol non-trivially produces a symbol string with itself on the left, the grammar is said to be \fBleft-recursive\fR. If any symbol non-trivially produces a symbol string with itself on the right, the grammar is said to be \fBright-recursive\fR. Marpa can handle all recursive grammars, including grammars which are left-recursive, grammars which are right-recursive, and grammars which contain both left\- and right-recursion. .PP A \fBcycle\fR is a non-trivial derivation of a string of symbols from itself. If it is not possible for any derivation using a grammar to contain a cycle, then that grammar is said to be \fBcycle-free\fR. Traditionally, a grammar is considered useless if it is not cycle-free. .PP The traditional deprecation of cycles is well-founded. A cycle is the parsing equivalent of an infinite loop. Once a cycle appears, it can be repeated over and over again. Even a very short input sentence can have an infinite number of parses when the grammar is not cycle-free. .PP For that reason, a grammar which contains a cycle is also called \fBinfinitely ambiguous\fR. Marpa can parse with grammars which are not cycle-free, and will even parse inputs that cause cycles. When a parse is infinitely ambiguous, Marpa limits cycles to a single loop, so that only a finite number of parses is returned. .SS "Parse structure" .IX Subsection "Parse structure" The structure of a parse can be represented as a series of derivation steps from the start symbol to the input. Another way to represent structure is as a \fBparse tree\fR. Every symbol used in the parse is represented by a \fBnode\fR of the parse tree. Wherever a production is used in the parse, its \s-1LHS\s0 is represented by a \fBparent node\fR and the \s-1RHS\s0 symbols are represented by \fBchild nodes\fR. The start symbol is the \fBroot\fR of the tree. The node at the root of the tree is called the \fBstart node\fR. .PP Traditionally, grammars divide all symbols sharply into terminals and non-terminals. A terminal symbol must \s-1ALWAYS\s0 be used as a terminal. A non-terminal symbol can \s-1NEVER\s0 be used as a terminal. .PP Marpa's use of terminals is non-traditional, and its terminology is different accordingly. As in the traditional approach, Marpa's non-terminals can never be used as terminals. But Marpa terminals can be used anywhere, even in places where the traditional approach requires a a non-terminal symbol. In particular, a Marpa terminal can be the \s-1LHS\s0 of a rule. .PP Traditionally, and in Marpa as well, every node is either a \&\fBinner node\fR or a \&\fBleaf node\fR. In Marpa, \&\fBleaf nodes\fR are of two kinds: .IP "\(bu" 4 Nodes for nulled symbols. A node for a nulled symbol is called a \fBnulled node\fR. .IP "\(bu" 4 Nodes for symbols being used as terminals. .SS "Ambiguity" .IX Subsection "Ambiguity" Marpa allows ambiguous grammars. Traditionally we say that a parse is \fBambiguous\fR if, for a given grammar and a given input, more than one derivation tree is possible. However, Marpa allows ambiguous input tokens, which the traditional definition does not take into account. If Marpa used the traditional definition, all grammars would be ambiguous except those grammars which allowed only the null parse. .PP It is easiest if the Marpa definition and the traditional definition were extensionally equivalent \-\-\- that is, if Marpa's set of ambiguous grammars was exactly the same as the set of traditionally ambiguous grammars. This can be accomplished by using a slightly altered definition. In the Marpa context, a grammar is \fBambiguous\fR if and only if, for some \s-1UNAMBIGUOUS\s0 stream of input tokens, that grammar produces more than one parse tree. .SS "Semantics" .IX Subsection "Semantics" In real life, the structure of a parse is usually a means to an end. Grammars usually have a \fBsemantics\fR associated with them, and what the user actually wants is the \fBvalue\fR of the parse according to the semantics. .PP The tree representation is especially useful when evaluating a parse. In the traditional method of evaluating a parse tree, every node which represents a terminal symbol has a value associated with it on input. Non-null inner nodes take their semantics from the production whose \s-1LHS\s0 they represent. Nulled nodes are dealt with as special cases. .PP The semantics for a production describe how to calculate the value of the node which represents the \s-1LHS \&\s0(the parent node) from the values of zero or more of the nodes which represent the \s-1RHS\s0 symbols (child nodes). Values are computed recursively, bottom-up. The value of a parse is the value of its start symbol. .SH "Copyright and License" .IX Header "Copyright and License" .Vb 5 \& Copyright 2014 Jeffrey Kegler \& This file is part of Marpa::R2. Marpa::R2 is free software: you can \& redistribute it and/or modify it under the terms of the GNU Lesser \& General Public License as published by the Free Software Foundation, \& either version 3 of the License, or (at your option) any later version. \& \& Marpa::R2 is distributed in the hope that it will be useful, \& but WITHOUT ANY WARRANTY; without even the implied warranty of \& MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU \& Lesser General Public License for more details. \& \& You should have received a copy of the GNU Lesser \& General Public License along with Marpa::R2. If not, see \& http://www.gnu.org/licenses/. .Ve