NAME¶
pt::peg::container - PEG Storage
SYNOPSIS¶
package require
Tcl 8.5
package require
snit
package require
pt::peg::container ?1?
::pt::peg objectName
?
=|
:=|
<--|
as|
deserialize src?
objectName destroy
objectName clear
objectName importer
objectName importer object
objectName exporter
objectName exporter object
objectName = source
objectName --> destination
objectName serialize ?
format?
objectName deserialize = data ?
format?
objectName deserialize += data ?
format?
objectName start
objectName start pe
objectName nonterminals
objectName modes
objectName modes dict
objectName rules
objectName rules dict
objectName add ?
nt...?
objectName remove ?
nt...?
objectName exists nt
objectName rename ntold ntnew
objectName mode nt
objectName mode nt mode
objectName rule nt
objectName rule nt pe
DESCRIPTION¶
Are you lost ? Do you have trouble understanding this document ? In that case
please read the overview provided by the
Introduction to Parser Tools.
This document is the entrypoint to the whole system the current package is a
part of.
This package provides a container class for parsing expression grammars, with
each instance storing a single grammar and allowing the user to manipulate and
query its definition.
It resides in the Storage section of the Core Layer of Parser Tools, and is one
of the three pillars the management of parsing expression grammars resides on.
IMAGE: arch_core_container
The other two pillars are, as shown above
- [1]
- PEG Import, and
- [2]
- PEG Export
Packages related to this are:
- pt::rde
- This package provides an implementation of PARAM, a virtual machine for
the parsing of a channel, geared towards the needs of handling PEGs.
- pt::peg::interp
- This package implements an interpreter for PEGs on top of the virtual
machine provided by pt::peg::rde
CLASS API¶
The package exports the API described here.
- ::pt::peg objectName
?=|:=|<--|as| deserialize
src?
- The command creates a new container object for a parsing expression
grammar and returns the fully qualified name of the object command as its
result. The API of this object command is described in the section
Object API. It may be used to invoke various operations on the
object.
The new container will be empty if no src is specified. Otherwise it
will contain a copy of the grammar contained in the src. All
operators except deserialize interpret src as a container
object command. The deserialize operator interprets src as
the serialization of a parsing expression grammar instead, as specified in
section PEG serialization format.
An empty grammar has no nonterminal symbols, and the start expression is the
empty expression, i.e. epsilon. It is valid, but not
useful.
OBJECT API¶
All objects created by this package provide the following methods for the
manipulation and querying of their contents:
- objectName destroy
- This method destroys the object, releasing all claimed memory, and
deleting the associated object command.
- objectName clear
- This method resets the object to contain the empty grammar. It does
not destroy the object itself.
- objectName importer
- This method returns the import manager object currently attached to the
container, if any.
- objectName importer object
- This method attaches the object as import manager to the container,
and returns it as the result of the command. Note that the object
is not put into ownership of the container. I.e., destruction of
the container will not destroy the object.
It is expected that object provides a method named import text
which takes a text and a format name, and returns the canonical
serialization of the table of contents contained in the text, assuming the
given format.
- objectName exporter
- This method returns the export manager object currently attached to the
container, if any.
- objectName exporter object
- This method attaches the object as export manager to the container,
and returns it as the result of the command. Note that the object
is not put into ownership of the container. I.e., destruction of
the container will not destroy the object.
It is expected that object provides a method named export
object which takes the container and a format name, and returns a text
encoding table of contents stored in the container, in the given format.
It is further expected that the object will use the container's
method serialize to obtain the serialization of the table of
contents from which to generate the text.
- objectName = source
- This method assigns the contents of the PEG object source to
ourselves, overwriting the existing definition. This is the assignment
operator for grammars.
This operation is in effect equivalent to
objectName deserialize = [source serialize]
- objectName --> destination
- This method assigns our contents to the PEG object destination,
overwriting the existing definition. This is the reverse assignment
operator for grammars.
This operation is in effect equivalent to
destination deserialize = [objectName serialize]
- objectName serialize ?format?
- This method returns our grammar in some textual form usable for transfer,
persistent storage, etc. If no format is not specified the returned
result is the canonical serialization of the grammar, as specified in the
section PEG serialization format.
Otherwise the object will use the attached export manager to convert the
data to the specified format. In that case the method will fail with an
error if the container has no export manager attached to it.
- objectName deserialize = data ?format?
- This is the complementary method to serialize. It replaces the
current definition with the grammar contained in the data. If no
format was specified it is assumed to be the regular serialization
of a grammar, as specified in the section PEG serialization format
Otherwise the object will use the attached import manager to convert the
data from the specified format to a serialization it can handle. In that
case the method will fail with an error if the container has no import
manager attached to it.
The result of the method is the empty string.
- objectName deserialize += data ?format?
- This method behaves like deserialize = in its essentials, except
that it merges the grammar in the data to its contents instead of
replacing it. The method will fail with an error and leave the grammar
unchanged if merging is not possible, i.e. would produce an invalid
grammar.
The result of the method is the empty string.
- objectName start
- This method returns the current start expression of the grammar.
- objectName start pe
- This method defines the start expression of the grammar. It
replaces the current start expression with the parsing expression
pe, and returns the new start expression.
The method will fail with an error and leave the grammar unchanged if
pe does not contain a valid parsing expression as specified in the
section PE serialization format.
- objectName nonterminals
- This method returns the set of all nonterminal symbols known to the
grammar.
- objectName modes
- This method returns a dictionary mapping the set of all nonterminal
symbols known to the grammar to their semantic modes.
- objectName modes dict
- This method takes a dictionary mapping a set of nonterminal symbols known
to the grammar to their semantic modes, and returns the new full mapping
of nonterminal symbols to semantic modes.
The method will fail with an error if any of the nonterminal symbols in the
dictionary is not known to the grammar, or the empty string, i.e. an
invalid nonterminal symbol, or if any the chosen modes is not one
of the legal values.
- objectName rules
- This method returns a dictionary mapping the set of all nonterminal
symbols known to the grammar to their parsing expressions (right-hand
sides).
- objectName rules dict
- This method takes a dictionary mapping a set of nonterminal symbols known
to the grammar to their parsing expressions (right-hand sides), and
returns the new full mapping of nonterminal symbols to parsing
expressions.
The method will fail with an error any of the nonterminal symbols in the
dictionary is not known to the grammar, or the empty string, i.e. an
invalid nonterminal symbol, or any of the chosen parsing expressions is
not a valid parsing expression as specified in the section PE
serialization format.
- objectName add ?nt...?
- This method adds the nonterminal symbols nt, etc. to the grammar,
and defines default semantic mode and expression for it ( value and
epsilon respectively). The method returns the empty string as its
result.
The method will fail with an error and leaves the grammar unchanged if any
of the nonterminal symbols are either already defined in our grammar, or
are the empty string (an invalid nonterminal symbol).
The method does nothing if no symbol was specified as argument.
- objectName remove ?nt...?
- This method removes the named nonterminal symbols nt, etc. from the
set of nonterminal symbols known to our grammar. The method returns the
empty string as its result.
The method will fail with an error and leave the grammar unchanged if any of
the nonterminal symbols is not known to the grammar, or is the empty
string, i.e. an invalid nonterminal symbol.
- objectName exists nt
- This method tests whether the nonterminal symbol nt is known to our
grammar or not. The result is a boolean value. It will be set to
true if nt is known, and false otherwise.
The method will fail with an error if nt is the empty string, i.e. an
invalid nonterminal symbol.
- objectName rename ntold ntnew
- This method renames the nonterminal symbol ntold to ntnew.
The method returns the empty string as its result.
The method will fail with an error and leave the grammar unchanged if either
ntold is not known to the grammar, or ntnew is already
known, or any of them is the empty string, i.e. an invalid nonterminal
symbol.
- objectName mode nt
- This method returns the current semantic mode for the nonterminal symbol
nt.
The method will fail with an error if nt is not known to the grammar,
or the empty string, i.e. an invalid nonterminal symbol.
- objectName mode nt mode
- This mode sets the semantic mode for the nonterminal symbol nt, and
returns the new mode. The method will fail with an error if nt is
not known to the grammar, or the empty string, i.e. an invalid nonterminal
symbol, or the chosen mode is not one of the legal values.
The following modes are legal:
- value
- The semantic value of the nonterminal symbol is an abstract syntax tree
consisting of a single node node for the nonterminal itself, which has the
ASTs of the symbol's right hand side as its children.
- leaf
- The semantic value of the nonterminal symbol is an abstract syntax tree
consisting of a single node node for the nonterminal, without any
children. Any ASTs generated by the symbol's right hand side are
discarded.
- void
- The nonterminal has no semantic value. Any ASTs generated by the symbol's
right hand side are discarded (as well).
- objectName rule nt
- This method returns the current parsing expression (right-hand side) for
the nonterminal symbol nt.
The method will fail with an error if nt is not known to the grammar,
or the empty string, i.e. an invalid nonterminal symbol.
- objectName rule nt pe
- This method set the parsing expression (right-hand side) of the
nonterminal nt to pe, and returns the new parsing
expression.
The method will fail with an error if nt is not known to the grammar,
or the empty string, i.e. an invalid nonterminal symbol, or pe does
not contain a valid parsing expression as specified in the section PE
serialization format.
Here we specify the format used by the Parser Tools to serialize Parsing
Expression Grammars as immutable values for transport, comparison, etc.
We distinguish between
regular and
canonical serializations. While
a PEG may have more than one regular serialization only exactly one of them
will be
canonical.
- regular serialization
- [1]
- The serialization of any PEG is a nested Tcl dictionary.
- [2]
- This dictionary holds a single key, pt::grammar::peg, and its
value. This value holds the contents of the grammar.
- [3]
- The contents of the grammar are a Tcl dictionary holding the set of
nonterminal symbols and the starting expression. The relevant keys and
their values are
- rules
- The value is a Tcl dictionary whose keys are the names of the nonterminal
symbols known to the grammar.
- [1]
- Each nonterminal symbol may occur only once.
- [2]
- The empty string is not a legal nonterminal symbol.
- [3]
- The value for each symbol is a Tcl dictionary itself. The relevant keys
and their values in this dictionary are
- is
- The value is the serialization of the parsing expression describing the
symbols sentennial structure, as specified in the section PE
serialization format.
- mode
- The value can be one of three values specifying how a parser should handle
the semantic value produced by the symbol.
- value
- The semantic value of the nonterminal symbol is an abstract syntax tree
consisting of a single node node for the nonterminal itself, which has the
ASTs of the symbol's right hand side as its children.
- leaf
- The semantic value of the nonterminal symbol is an abstract syntax tree
consisting of a single node node for the nonterminal, without any
children. Any ASTs generated by the symbol's right hand side are
discarded.
- void
- The nonterminal has no semantic value. Any ASTs generated by the symbol's
right hand side are discarded (as well).
- start
- The value is the serialization of the start parsing expression of the
grammar, as specified in the section PE serialization format.
- [4]
- The terminal symbols of the grammar are specified implicitly as the set of
all terminal symbols used in the start expression and on the RHS of the
grammar rules.
- canonical serialization
- The canonical serialization of a grammar has the format as specified in
the previous item, and then additionally satisfies the constraints below,
which make it unique among all the possible serializations of this
grammar.
- [1]
- The keys found in all the nested Tcl dictionaries are sorted in ascending
dictionary order, as generated by Tcl's builtin command lsort
-increasing -dict.
- [2]
- The string representation of the value is the canonical representation of
a Tcl dictionary. I.e. it does not contain superfluous whitespace.
EXAMPLE¶
Assuming the following PEG for simple mathematical expressions
PEG calculator (Expression)
Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9' ;
Sign <- '-' / '+' ;
Number <- Sign? Digit+ ;
Expression <- Term (AddOp Term)* ;
MulOp <- '*' / '/' ;
Term <- Factor (MulOp Factor)* ;
AddOp <- '+'/'-' ;
Factor <- '(' Expression ')' / Number ;
END;
then its canonical serialization (except for whitespace) is
pt::grammar::peg {
rules {
AddOp {is {/ {t -} {t +}} mode value}
Digit {is {/ {t 0} {t 1} {t 2} {t 3} {t 4} {t 5} {t 6} {t 7} {t 8} {t 9}} mode value}
Expression {is {x {n Term} {* {x {n AddOp} {n Term}}}} mode value}
Factor {is {/ {x {t (} {n Expression} {t )}} {n Number}} mode value}
MulOp {is {/ {t *} {t /}} mode value}
Number {is {x {? {n Sign}} {+ {n Digit}}} mode value}
Sign {is {/ {t -} {t +}} mode value}
Term {is {x {n Factor} {* {x {n MulOp} {n Factor}}}} mode value}
}
start {n Expression}
}
Here we specify the format used by the Parser Tools to serialize Parsing
Expressions as immutable values for transport, comparison, etc.
We distinguish between
regular and
canonical serializations. While
a parsing expression may have more than one regular serialization only exactly
one of them will be
canonical.
- Regular serialization
- Atomic Parsing Expressions
- [1]
- The string epsilon is an atomic parsing expression. It matches the
empty string.
- [2]
- The string dot is an atomic parsing expression. It matches any
character.
- [3]
- The string alnum is an atomic parsing expression. It matches any
Unicode alphabet or digit character. This is a custom extension of PEs
based on Tcl's builtin command string is.
- [4]
- The string alpha is an atomic parsing expression. It matches any
Unicode alphabet character. This is a custom extension of PEs based on
Tcl's builtin command string is.
- [5]
- The string ascii is an atomic parsing expression. It matches any
Unicode character below U0080. This is a custom extension of PEs based on
Tcl's builtin command string is.
- [6]
- The string control is an atomic parsing expression. It matches any
Unicode control character. This is a custom extension of PEs based on
Tcl's builtin command string is.
- [7]
- The string digit is an atomic parsing expression. It matches any
Unicode digit character. Note that this includes characters outside of the
[0..9] range. This is a custom extension of PEs based on Tcl's builtin
command string is.
- [8]
- The string graph is an atomic parsing expression. It matches any
Unicode printing character, except for space. This is a custom extension
of PEs based on Tcl's builtin command string is.
- [9]
- The string lower is an atomic parsing expression. It matches any
Unicode lower-case alphabet character. This is a custom extension of PEs
based on Tcl's builtin command string is.
- [10]
- The string print is an atomic parsing expression. It matches any
Unicode printing character, including space. This is a custom extension of
PEs based on Tcl's builtin command string is.
- [11]
- The string punct is an atomic parsing expression. It matches any
Unicode punctuation character. This is a custom extension of PEs based on
Tcl's builtin command string is.
- [12]
- The string space is an atomic parsing expression. It matches any
Unicode space character. This is a custom extension of PEs based on Tcl's
builtin command string is.
- [13]
- The string upper is an atomic parsing expression. It matches any
Unicode upper-case alphabet character. This is a custom extension of PEs
based on Tcl's builtin command string is.
- [14]
- The string wordchar is an atomic parsing expression. It matches any
Unicode word character. This is any alphanumeric character (see alnum),
and any connector punctuation characters (e.g. underscore). This is a
custom extension of PEs based on Tcl's builtin command string
is.
- [15]
- The string xdigit is an atomic parsing expression. It matches any
hexadecimal digit character. This is a custom extension of PEs based on
Tcl's builtin command string is.
- [16]
- The string ddigit is an atomic parsing expression. It matches any
decimal digit character. This is a custom extension of PEs based on Tcl's
builtin command regexp.
- [17]
- The expression [list t x] is an atomic parsing expression. It
matches the terminal string x.
- [18]
- The expression [list n A] is an atomic parsing expression. It
matches the nonterminal A.
- Combined Parsing Expressions
- [1]
- For parsing expressions e1, e2, ... the result of [list /
e1 e2 ... ] is a parsing expression as well. This is the
ordered choice, aka prioritized choice.
- [2]
- For parsing expressions e1, e2, ... the result of [list x
e1 e2 ... ] is a parsing expression as well. This is the
sequence.
- [3]
- For a parsing expression e the result of [list * e] is a
parsing expression as well. This is the kleene closure, describing
zero or more repetitions.
- [4]
- For a parsing expression e the result of [list + e] is a
parsing expression as well. This is the positive kleene closure,
describing one or more repetitions.
- [5]
- For a parsing expression e the result of [list & e] is a
parsing expression as well. This is the and lookahead
predicate.
- [6]
- For a parsing expression e the result of [list ! e] is a
parsing expression as well. This is the not lookahead
predicate.
- [7]
- For a parsing expression e the result of [list ? e] is a
parsing expression as well. This is the optional input.
- Canonical serialization
- The canonical serialization of a parsing expression has the format as
specified in the previous item, and then additionally satisfies the
constraints below, which make it unique among all the possible
serializations of this parsing expression.
- [1]
- The string representation of the value is the canonical representation of
a pure Tcl list. I.e. it does not contain superfluous whitespace.
- [2]
- Terminals are not encoded as ranges (where start and end of the
range are identical).
EXAMPLE¶
Assuming the parsing expression shown on the right-hand side of the rule
Expression <- Term (AddOp Term)*
then its canonical serialization (except for whitespace) is
{x {n Term} {* {x {n AddOp} {n Term}}}}
BUGS, IDEAS, FEEDBACK¶
This document, and the package it describes, will undoubtedly contain bugs and
other problems. Please report such in the category
pt of the
Tcllib
Trackers [
http://core.tcl.tk/tcllib/reportlist]. Please also report any
ideas for enhancements you may have for either package and/or documentation.
KEYWORDS¶
EBNF, LL(k), PEG, TDPL, context-free languages, expression, grammar, matching,
parser, parsing expression, parsing expression grammar, push down automaton,
recursive descent, state, top-down parsing languages, transducer
CATEGORY¶
Parsing and Grammars
COPYRIGHT¶
Copyright (c) 2009 Andreas Kupries <andreas_kupries@users.sourceforge.net>