NAME¶
bt_language - the BibTeX data language, as recognized by btparse
SYNOPSIS¶
# Lexical grammar, mode 1: top-level
AT \@
NEWLINE \n
COMMENT \%~[\n]*\n
WHITESPACE [\ \r\t]+
JUNK ~[\@\n\ \r\t]+
# Lexical grammar, mode 2: in-entry
NEWLINE \n
COMMENT \%~[\n]*\n
WHITESPACE [\ \r\t]+
NUMBER [0-9]+
NAME [a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+
LBRACE \{
RBRACE \}
LPAREN \(
RPAREN \)
EQUALS =
HASH \#
COMMA ,
QUOTE \"
# Lexical grammar, mode 3: strings
# (very hairy -- see text)
# Syntactic grammar:
bibfile : ( entry )*
entry : AT NAME body
body : STRING # for comment entries
| ENTRY_OPEN contents ENTRY_CLOSE
contents : ( NAME | NUMBER ) COMMA fields # for regular entries
| fields # for macro definition entries
| value # for preamble entries
fields : field { COMMA fields }
|
field : NAME EQUALS value
value : simple_value ( HASH simple_value )*
simple_value : STRING
| NUMBER
| NAME
DESCRIPTION¶
One of the problems with BibTeX is that there is no formal specification of the
language. This means that users exploring the arcane corners of the language
are largely on their own, and programmers implementing their own parsers are
completely on their own---except for observing the behaviour of the original
implementation.
Other parser implementors (Nelson Beebe of "bibclean" fame, in
particular) have taken the trouble to explain the language accepted by their
parser, and in that spirit the following is presented.
If you are unfamiliar with the arcana of regular and context-free languages, you
will not have any easy time understanding this. This is
not an
introduction to the BibTeX language; any LaTeX book would be more suitable for
learning the data language itself.
LEXICAL GRAMMAR¶
The lexical scanner has three distinct modes: top-level, in-entry, and string.
Roughly speaking, top-level is the initial mode; we enter in-entry mode on
seeing an "@" at top-level; and on seeing the "}" or
")" that ends the entry, we return to top-level. We enter string
mode on seeing a """ or non-entry-delimiting "{" from
in-entry mode. Note that the lexical language is both non-regular (because
braces must balance) and context-sensitive (because "{" can mean
different things depending on its syntactic context). That said, we will use
regular expressions to describe the lexical elements, because they are the
starting point used by the lexical scanner itself. The rest of the lexical
grammar will be informally explained in the text.
From top-level, the following tokens are recognized according to the regular
expressions on the right:
AT \@
NEWLINE \n
COMMENT \%~[\n]*\n
WHITESPACE [\ \r\t]+
JUNK ~[\@\n\ \r\t]+
(Note that this is PCCTS regular expression syntax, which should be fairly
familiar to users of other regex engines. One oddity is that a character class
is negated as "~[...]" rather than "[^...]".)
On seeing "at" at top-level, we enter in-entry mode. Whitespace, junk,
newlines, and comments are all skipped, with the latter two incrementing a
line counter. (Junk is explicitly recognized to allow for "bibtex"'s
"implicit comment" scheme.)
From in-entry mode, we recognize newline, comment, and whitespace identically to
top-level mode. In addition, the following tokens are recognized:
NUMBER [0-9]+
NAME [a-z0-9\!\$\&\*\+\-\.\/\:\;\<\>\?\[\]\^\_\`\|]+
LBRACE \{
RBRACE \}
LPAREN \(
RPAREN \)
EQUALS =
HASH \#
COMMA ,
QUOTE \"
At this point, the lexical scanner starts to sound suspiciously like a
context-free grammar, rather than a collection of independent regular
expressions. However, it is necessary to keep this complexity in the scanner
because certain characters ("{" and "(" in particular)
have very different lexical meanings depending on the tokens that have
preceded them in the input stream.
In particular, "{" and "(" are treated as "entry
openers" if they follow one "at" and one "name"
token, unless the value of the "name" token is "comment".
(Note the switch from top-level to in-entry between the two tokens.) In the
@comment case, the delimiter is considered as starting a string, and we enter
string mode. Otherwise, the delimiter is saved, and when we see a
corresponding "}" or ")" it is considered an "entry
closer". (Braces are balanced for free here because the string lexer
takes care of counting brace-depth.)
Anywhere else, "{" is considered as starting a string, and we enter
string mode. """ always starts a string, regardless of context.
The other tokens ("name", "number", "equals",
"hash", and "comma") are recognized unconditionally.
Note that "name" is a catch-all token used for entry types, citation
keys, field names, and macro names; because BibTeX has slightly different
(largely undocumented) rules for these various elements, a bit of trickery is
needed to make things work. As a starting point, consider BibTeX's definition
of what's allowed for an entry key: a sequence of any characters
except
" # % ' ( ) , = { }
plus space. There are a couple of problems with this scheme. First, without
specifying the character set from which those "magic 10" characters
are drawn, it's a bit hard to know just what is allowed. Second, allowing
"@" characters could lead to confusing BibTeX syntax (it doesn't
confuse BibTeX, but it might confuse a human reader). Finally, allowing
certain characters that are special to TeX means that BibTeX can generate
bogus TeX code: try putting a backslash ("\") or tilde
("~") in a citation key. (This last exception is rather specific to
the "generating (La)TeX code from a BibTeX database" application,
but since that's the major application for BibTeX databases, then it will
presumably be the major application for
btparse, at least initially.
Thus, it makes sense to pay attention to this problem.)
In
btparse, then, a name is defined as any sequence of letters, digits,
underscores, and the following characters:
! $ & * + - . / : ; < > ? [ ] ^ _ ` |
This list was derived by removing BibTeX's "magic 10" from the set of
printable 7-bit ASCII characters (32-126), and then further removing
"@", "\", and "~". This means that
btparse disallows some of the weirder entry keys that BibTeX would
accept, such as "\foo@bar", but still allows a string with initial
digits. In fact, from the above definition it appears that
btparse
would accept a string of all digits as a "name;" this is not the
case, though, as the lexical scanner recognizes such a digit string as a
number first. There are two problems here: BibTeX entry keys may in fact be
entirely numeric, and field names may not begin with a digit. (Those are two
of the not-so-obvious differences in BibTeX's handling of keys and field
names.) The tricks used to deal with these problems are implemented in the
parser rather than the lexical scanner, so are described in "SYNTACTIC
GRAMMAR" below.
The string lexer recognizes "lbrace", "rbrace",
"lparen", and "rparen" tokens in order to count brace- or
parenthesis-depth. This is necessary so it knows when to accept a string
delimited by braces or parentheses. (Note that a parenthesis-delimited string
is only allowed after @comment---this is not a normal BibTeX construct.) In
addition, it converts each non-space whitespace character (newline,
carriage-return, and tab) to a single space. (Sequences of whitespace are not
collapsed; that's the domain of string post-processing, which is well removed
from the scanner or parser.) Finally, it accepts """ to delimit
quote-delimited strings. Apart from those restrictions, the string lexer
accepts anything up to the end-of-string delimiter.
SYNTACTIC GRAMMAR¶
(The language used to describe the grammar here is the extended Backus-Naur Form
(EBNF) used by PCCTS. Terminals are represented by uppercase strings,
non-terminals by lowercase strings; terminal names are the same as those given
in the lexical grammar above. "( foo )*" means zero or more
repetitions of the "foo" production, and "{ foo }" means
an optional "foo".)
A file is just a sequence of zero or more entries:
bibfile : ( entry )*
An entry is an at-sign, a name (the "entry type"), and the entry body:
entry : AT NAME body
A body is either a string (this alternative is only tried if the entry type is
"comment") or the entry contents:
body : STRING # for comment entries
| ENTRY_OPEN contents ENTRY_CLOSE
("ENTRY_OPEN" and "ENTRY_CLOSE" are either "{" and
"}" or "(" and ")", depending what is seen in
the input for a particular entry.)
There are three possible productions for the "contents" non-terminal.
Only one applies to any given entry, depending on the entry metatype (which in
turn depends on the entry type). Currently,
btparse supports four entry
metatypes: comment, preamble, macro definition, and regular. The first two
correspond to @comment and @preamble entries; "macro definition" is
for @string entries; and "regular" is for all other entry types.
(The library will be extended to handle @modify and @alias entry types, and
corresponding "modify" and "alias" metatypes, when BibTeX
1.0 is released and the exact syntax is known.) The "metatype"
concept is necessary so that all entry types that aren't specifically
recognized fall into the "regular" metatype. It's also convenient
not to have to "strcmp" the entry type all the time.
contents : ( NAME | NUMBER ) COMMA fields # for regular entries
| fields # for macro definition entries
| value # for preamble entries
Note that the entry key is not just a "NAME", but "( NAME |
NUMBER)". This is necessary because BibTeX allows all-numeric entry keys,
but
btparse's lexical scanner recognizes such digit strings as
"NUMBER" tokens.
"fields" is a comma-separated list of fields, with an optional single
trailing comma:
fields : field { COMMA fields }
|
A "field" is a single "field = value" assignment:
field : NAME EQUALS value
Note that "NAME" here is a restricted version of the "name"
token described in "LEXICAL GRAMMAR" above. Any "name"
token will be accepted by the parser, but it is immediately checked to ensure
that it doesn't begin with a digit; if so, an artificial syntax error is
triggered. (This is for compatibility with BibTeX, which doesn't allow field
names to start with a digit.)
A "value" is a series of simple values joined by '#' characters:
value : simple_value ( HASH simple_value )*
A simple value is a string, number, or name (for macro invocations):
simple_value : STRING
| NUMBER
| NAME
SEE ALSO¶
btparse
AUTHOR¶
Greg Ward <gward@python.net>