NAME¶
bt_post_processing - post-processing of BibTeX strings, values, and entries
SYNOPSIS¶
void bt_postprocess_string (char * s,
btshort options)
char * bt_postprocess_value (AST * value,
btshort options,
boolean replace);
char * bt_postprocess_field (AST * field,
btshort options,
boolean replace);
void bt_postprocess_entry (AST * entry,
btshort options);
DESCRIPTION¶
When
btparse parses a BibTeX entry, it initially stores the results in an
abstract syntax tree (AST), in a form exactly mirroring the parsed data. For
example, the entry
@Article{Jones:1997a,
AuThOr = "Bob Jones" # and # "Jim Smith ",
TITLE = "Feeding Habits of
the Common Cockroach",
JoUrNaL = j_ent,
YEAR = 1997
}
would parse to an AST that could be represented as follows:
(entry,"Article")
(key,"Jones:1997a")
(field,"AuThOr")
(string,"Bob Jones")
(macro,"and")
(string,"Jim Smith ")
(field,"TITLE")
(string,"Feeding Habits of the Common Cockroach")
(field,"JoUrNaL")
(macro,"j_ent")
(field,"YEAR")
(number,"1997")
The advantage of this form is that all the important information in the entry is
readily available by traversing the tree using the functions described in
bt_traversal. This obvious problem is that the data is a little too raw to be
immediately useful: entry types and field names are inconsistently
capitalized, strings are full of unwanted whitespace, field values not reduced
to single strings, and so forth.
All of these problems are addressed by
btparse's post-processing
functions, described here. Normally, you won't have to call these
functions---the library does the Right Thing for you after parsing each entry,
and you can customize what exactly the Right Thing is for your application.
(For instance, you can tell it to expand macros, but not to concatenate
substrings together.) However, it's conceivable that you might wish to move
the post-processing into your own code and out of the library's control. More
likely, you could have strings that come from something other than BibTeX
files that you would like to have treated as BibTeX strings; for that
situation, the post-processing functions are essential. Finally, you might
just be curious about what exactly happens to your data after it's parsed. If
so, you've come to the right place for excruciatingly detailed explanations.
FUNCTIONS¶
btparse offers four points of entry to its post-processing code. Of
these, probably only the first and last---for processing individual strings
and whole entries---will be commonly used.
Post-processing entry points¶
To understand why four entry points are offered, an explanation of the sample
AST shown above will help. First of all, the whole entry is represented by the
"(entry,"Article")" node; this node has the entry key and
all its field/value pairs as children. Entry nodes are returned by
"bt_parse_entry()" and "bt_parse_entry_s()" (see bt_input)
as well as "bt_next_entry()" (which traverses a list of entries
returned from "bt_parse_file()"---see bt_traversal). Whole entries
may be post-processed with "bt_postprocess_entry()".
You may also need to post-process a single field, or just the value associated
with it. (The difference is that processing the field can change the field
name---e.g. to lowercase---in addition to the field value.) The
"(field,"AuThOr")" node above is an example of a field
sub-AST, and "(string,"Bob Jones")" is the first node in
the list of simple values representing that field's value. (Recall that a
field value is, in general, a list of simple values.) Field nodes are returned
by "bt_next_field()", value nodes by "bt_next_value()".
The former may be passed to "bt_postprocess_field()" for
post-processing, the latter to "bt_postprocess_value()".
Finally, individual strings may wander into your program from many places other
than a
btparse AST. For that reason,
"bt_postprocess_string()" is available for post-processing arbitrary
strings.
Post-processing options¶
All of the post-processing routines have an "options" parameter, which
you can use to fine-tune the post-processing. (This is just like the
per-metatype string-processing options that you can set before parsing
entries; see "bt_set_stringopts()" in bt_input.) Like elsewhere in
the library, "options" is a bitmap constructed by or'ing together
various predefined constants. These constants and their effects are documented
in "String processing option macros" in btparse.
- bt_postprocess_string ()
-
void bt_postprocess_string (char * s,
btshort options)
Post-processes an individual string, "s", which is modified in
place. The only post-processing option that makes sense on individual
strings is whether to collapse whitespace according to the BibTeX rules;
thus, if "options & BTO_COLLAPSE" is false, this function
has no effect. (Although it makes a complete pass over the string anyways.
This is for future expansion.)
The exact rules for collapsing whitespace are simple: non-space whitespace
characters (tabs and newlines mainly) are converted to space, any strings
of more than one space within are collapsed to a single space, and any
leading or trailing spaces are deleted. (Ensuring that all whitespace is
spaces is actually done by btparse's lexical scanner, so strings in
btparse ASTs will never have whitespace apart from space. Likewise,
any strings passed to bt_postprocess_string() should not contain
non-space whitespace characters.)
- bt_postprocess_value ()
-
char * bt_postprocess_value (AST * value,
btshort options,
boolean replace);
Post-processes a single field value, which is the head of a list of simple
values as returned by "bt_next_value()". All of the relevant
string-processing options come into play here: conversion of numbers to
strings ("BTO_CONVERT"), macro expansion
("BTO_EXPAND"), collapsing of whitespace
("BTO_COLLAPSE"), and string pasting ("BTO_PASTE").
Since pasting substrings together without first expanding macros and
converting numbers would be nonsensical, attempting to do so is a fatal
error.
If "replace" is true, then the list headed by "value"
will be replaced by a list representing the processed value. That is, if
string pasting is turned on ("options & BTO_PASTE" is true),
then this list will be collapsed to a single node containing the single
string that results from pasting together all the substrings. If string
pasting is not on, then each node in the list will be left intact, but
will have its text replaced by processed text.
If "replace" is false, then a new string will be built on the fly
and returned by the function. Note that if pasting is not on in this case,
you will only get the last string in the list. (It doesn't really make a
lot of sense to post-process a value without pasting unless you're
replacing it with the new value, though.)
Returns the string that resulted from processing the whole value, which only
makes sense if pasting was on or there was only one value in the list. If
a multiple-value list was processed without pasting, the last string in
the list is returned (after processing).
Consider what might be done to the value of the "author" field in
the above example, which is the concatenation of a string, a macro, and
another string. Assume that the macro "and" expands to "
and ", and that the variable "value" points to the sub-AST
for this value. The original sub-AST corresponding to this value is
(string,"Bob Jones")
(macro,"and")
(string,"Jim Smith ")
To fully process this value in-place, you would call
bt_postprocess_value (value, BTO_FULL, TRUE);
("BTO_FULL" is just the combination of all possible
string-processing options:
"BTO_CONVERT|BTO_EXPAND|BTO_PASTE|BTO_COLLAPSE".) This would
convert the value to a single-element list,
(string,"Bob Jones and Jim Smith")
and return the fully-processed string "Bob Jones and Jim Smith".
Note that the "and" macro has been expanded, interpolated
between the two literal strings, everything pasted together, and finally
whitespace collapsed. (Collapsing whitespace before concatenating the
strings would be a bad idea.)
Let's say you'd rather preserve the list nature of the value, while
expanding macros and converting any numbers to strings. (This conversion
is trivial: it just changes the type of the node from
"BTAST_NUMBER" to "BTAST_STRING". "Number"
values are always stored as a string of digits, just as they appear in the
file.) This would be done with the call
bt_postprocess_value
(value, BTO_CONVERT|BTO_EXPAND|BTO_COLLAPSE,TRUE);
which would change the list to
(string,"Bob Jones")
(string,"and")
(string,"Jim Smith")
Note that whitespace is collapsed here before any concatenation can
be done; this is probably a bad idea. But you can do it if you wish. (If
you get any ideas about cooking up your own value post-processing scheme
by doing it in little steps like this, take a look at the source to
"bt_postprocess_value()"; it should dissuade you from such a
venture.)
- bt_postprocess_field ()
-
char * bt_postprocess_field (AST * field,
btshort options,
boolean replace);
This is little more than a front-end to "bt_postprocess_value()";
the only difference is that you pass it a "field" AST node (eg.
the "(field,"AuThOr")" in the above example), and that
it transforms the field name in addition to its value. In particular, the
field name is forced to lowercase; this behaviour is (currently) not
optional.
Returns the string returned by "bt_postprocess_value()".
- bt_postprocess_entry ()
-
void bt_postprocess_entry (AST * entry,
btshort options);
Post-processes all values in an entry. If "entry" points to the
AST for a "regular" or "macro definition" entry, then
the values are just what you'd expect: everything on the right-hand side
of a field or macro "assignment." You can also post-process
comment and preamble entries, though. Comment entries are essentially one
big string, so only whitespace collapsing makes sense on them. Preambles
may have multiple strings pasted together, so all the string-processing
options apply to them. (And there's nothing to prevent you from using
macros in a preamble.)