NAME¶
Text::BibTeX::Name - interface to BibTeX-style author names
SYNOPSIS¶
$name = new Text::BibTeX::Name;
$name->split('J. Random Hacker');
# or:
$name = new Text::BibTeX::Name ('J. Random Hacker');
@firstname_tokens = $name->part ('first');
$lastname = join (' ', $name->part ('last'));
$format = new Text::BibTeX::NameFormat;
# ...customize $format...
$formatted = $name->format ($format);
DESCRIPTION¶
"Text::BibTeX::Name" provides an abstraction for BibTeX-style names
and some basic operations on them. A name, in the BibTeX world, consists of a
list of
tokens which are divided amongst four
parts: `first',
`von', `last', and `jr'.
Tokens are separated by whitespace or commas at brace-level zero. Thus the name
van der Graaf, Horace Q.
has five tokens, whereas the name
{Foo, Bar, and Sons}
consists of a single token. Skip down to "EXAMPLES" for more examples,
or read on if you want to know the exact details of how names are split into
tokens and parts.
How tokens are divided into parts depends on the form of the name. If the name
has no commas at brace-level zero (as in the second example), then it is
assumed to be in either "first last" or "first von last"
form. If there are no tokens that start with a lower-case letter, then
"first last" form is assumed: the final token is the last name, and
all other tokens form the first name. Otherwise, the earliest contiguous
sequence of tokens with initial lower-case letters is taken as the `von' part;
if this sequence includes the final token, then a warning is printed and the
final token is forced to be the `last' part.
If a name has a single comma, then it is assumed to be in "von last,
first" form. A leading sequence of tokens with initial lower-case
letters, if any, forms the `von' part; tokens between the `von' and the comma
form the `last' part; tokens following the comma form the `first' part. Again,
if there are no tokens following a leading sequence of lowercase tokens, a
warning is printed and the token immediately preceding the comma is taken to
be the `last' part.
If a name has more than two commas, a warning is printed and the name is treated
as though only the first two commas were present.
Finally, if a name has two commas, it is assumed to be in "von last, jr,
first" form. (This is the only way to represent a name with a `jr' part.)
The parsing of the name is the same as for a one-comma name, except that
tokens between the two commas are taken to be the `jr' part.
CAVEAT¶
The C code that does the actual work of splitting up names takes a shortcut and
makes few assumptions about whitespace. In particular, there must be no
leading whitespace, no trailing whitespace, no consecutive whitespace
characters in the string, and no whitespace characters other than space. In
other words, all whitespace must consist of lone internal spaces.
EXAMPLES¶
The strings "John Smith" and "Smith, John" are different
representations of the same name, so split into parts and tokens the same way,
namely as:
first => ('John')
von => ()
last => ('Smith')
jr => ()
Note that every part is a list of tokens, even if there is only one token in
that part; empty parts get empty token lists. Every token is just a string.
Writing this example in actual code is simple:
$name = new Text::BibTeX::Name ("John Smith"); # or "Smith, John"
$name->part ('first'); # returns list ("John")
$name->part ('last'); # returns list ("Smith")
$name->part ('von'); # returns list ()
$name->part ('jr'); # returns list ()
(We'll omit the empty parts in the rest of the examples: just assume that any
unmentioned part is an empty list.) If more than two tokens are included and
there's no comma, they'll go to the first name: thus "John Q. Smith"
splits into
first => ("John", "Q."))
last => ("Smith")
and "J. R. R. Tolkein" into
first => ("J.", "R.", "R.")
last => ("Tolkein")
The ambiguous name "Kevin Philips Bong" splits into
first => ("Kevin", "Philips")
last => ("Bong")
which may or may not be the right thing, depending on the particular person.
There's no way to know though, so if this fellow's last name is "Philips
Bong" and not "Bong", the string representation of his name
must disambiguate. One possibility is "Philips Bong, Kevin" which
splits into
first => ("Kevin")
last => ("Philips", "Bong")
Alternately, "Kevin {Philips Bong}" takes advantage of the fact that
tokes are only split on whitespace
at brace-level zero, and becomes
first => ("Kevin")
last => ("{Philips Bong}")
which is fine if your names are destined to be processed by TeX, but might be
problematic in other contexts. Similarly, "St John-Mollusc, Oliver"
becomes
first => ("Oliver")
last => ("St", "John-Mollusc")
which can also be written as "Oliver {St John-Mollusc}":
first => ("Oliver")
last => ("{St John-Mollusc}")
Since tokens are separated purely by whitespace, hyphenated names will work
either way: both "Nigel Incubator-Jones" and "Incubator-Jones,
Nigel" come out as
first => ("Nigel")
last => ("Incubator-Jones")
Multi-token last names with lowercase components -- the "von part" --
work fine: both "Ludwig van Beethoven" and "van Beethoven,
Ludwig" parse (correctly) into
first => ("Ludwig")
von => ("van")
last => ("Beethoven")
This allows these European aristocratic names to sort properly, i.e.
van
Beethoven under
B rather than
v. Speaking of aristocratic
European names, "Charles Louis Xavier Joseph de la Vall{\'e}e
Poussin" is handled just fine, and splits into
first => ("Charles", "Louis", "Xavier", "Joseph")
von => ("de", "la")
last => ("Vall{\'e}e", "Poussin")
so could be sorted under
V rather than
d. (Note that the sorting
algorithm in Text::BibTeX::BibSort is a slavish imitiation of BibTeX 0.99, and
therefore does the wrong thing with these names: the sort key starts with the
"von" part.)
However, capitalized "von parts" don't work so well: "R. J. Van
de Graaff" splits into
first => ("R.", "J.", "Van")
von => ("de")
last => ("Graaff")
which is clearly wrong. This name should be represented as "Van de Graaff,
R. J."
first => ("R.", "J.")
last => ("Van", "de", "Graaff")
which is probably right. (This particular Van de Graaff was an American, so he
probably belongs under
V -- which is where my (British) dictionary puts
him. Other Van de Graaff's mileages may vary.)
Finally, many names include a suffix: "Jr.", "III",
"fils", and so forth. These are handled, but with some limitations.
If there's a comma before the suffix (the usual U.S. convention for
"Jr."), then the name should be in
last, jr, first form, e.g.
"Doe, Jr., John" comes out (correctly) as
first => ("John")
last => ("Doe")
jr => ("Jr.")
but "John Doe, Jr." is ambiguous and is parsed as
first => ("Jr.")
last => ("John", "Doe")
(so don't do it that way). If there's no comma before the suffix -- the usual
for Roman numerals, and occasionally seen with "Jr." -- then you're
stuck and have to make the suffix part of the last name. Thus, "Gates
III, William H." comes out
first => ("William", "H.")
last => ("Gates", "III")
but "William H. Gates III" is ambiguous, and becomes
first => ("William", "H.", "Gates")
last => ("III")
-- not what you want. Again, the curly-brace trick comes in handy, so
"William H. {Gates III}" splits into
first => ("William", "H.")
last => ("{Gates III}")
There is no way to make a comma-less suffix the "jr" part. (This is an
unfortunate consequence of slavishly imitating BibTeX 0.99.)
Finally, names that aren't really names of people but rather are organization or
company names should be forced into a single token by wrapping them in curly
braces. For example, "Foo, Bar and Sons" should be written
"{Foo, Bar and Sons}", which will split as
last => ("{Foo, Bar and Sons}")
Of course, if this is one name in a BibTeX "authors" or
"editors" list, this name has to be wrapped in braces anyways
(because of the " and "), but that's another story.
Putting a split-up name back together again in a flexible, customizable way is
the job of another module: see Text::BibTeX::NameFormat.
METHODS¶
- new (CLASS [, NAME [, FILENAME, LINE, NAME_NUM]])
- Creates a new "Text::BibTeX::Name" object. If
NAME is supplied, it must be a string containing a single name, and it
will be be passed to the "split" method for further processing.
FILENAME, LINE, and NAME_NUM, if present, are all also passed to
"split" to allow better error messages.
- split (NAME [, FILENAME, LINE, NAME_NUM])
- Splits NAME (a string containing a single name) into tokens
and subsequently into the four parts of a BibTeX-style name (first, von,
last, and jr). (Each part is a list of tokens, and tokens are separated by
whitespace or commas at brace-depth zero. See above for full details on
how a name is split into its component parts.)
The token-lists that make up each part of the name are then stored in the
"Text::BibTeX::Name" object for later retrieval or formatting
with the "part" and "format" methods.
- part (PARTNAME)
- Returns the list of tokens in part PARTNAME of a name
previously split with "split". For example, suppose a
"Text::BibTeX::Name" object is created and initialized like
this:
$name = new Text::BibTeX::Name;
$name->split ('Charles Louis Xavier Joseph de la Vall{\'e}e Poussin');
Then this code:
$name->part ('von');
would return the list "('de','la')".
- format (FORMAT)
- Formats a name according to the specifications encoded in
FORMAT, which should be a "Text::BibTeX::NameFormat" (or
descendant) object. (In short, it must supply a method "apply"
which takes a "Text::BibTeX::NameFormat" object as its only
argument.) Returns the formatted name as a string.
See Text::BibTeX::NameFormat for full details on formatting names.
SEE ALSO¶
Text::BibTeX::Entry, Text::BibTeX::NameFormat, bt_split_names.
AUTHOR¶
Greg Ward <gward@python.net>
COPYRIGHT¶
Copyright (c) 1997-2000 by Gregory P. Ward. All rights reserved. This file is
part of the Text::BibTeX library. This library is free software; you may
redistribute it and/or modify it under the same terms as Perl itself.