.\" Automatically generated by Pod::Man 4.14 (Pod::Simple 3.43) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{\ . if \nF \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" ======================================================================== .\" .IX Title "Text::Names 3pm" .TH Text::Names 3pm "2023-02-04" "perl v5.36.0" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .Vb 6 \& #warn "doing " . join(" ",@bits); \& #while ($#bits > 0) { \& # @surnames = pop @bits; \& #} \& #my $surname = pop @bits; \& return ($bits[0], join(\*(Aq \*(Aq,@surnames)); .Ve .SH "NAME" Text::Names \- Perl extension for proper name parsing, normalization, recognition, and classification .SH "SYNOPSIS" .IX Header "SYNOPSIS" .Vb 1 \& use Text::Names qw/parseNames samePerson/; \& \& my @authors = parseNames("D Bourget, Zbigniew Z Lukasiak and John Doe"); \& \& # @authors = (\*(AqBourget, D.\*(Aq,\*(AqLukasiak, Zbigniew Z.\*(Aq,\*(AqDoe, John\*(Aq) \& \& print "same!" if samePerson("Dave Bourget","David F. Bourget"); \& \& # same! \& \& print guessGender("David"); \& \& # "M" .Ve .SH "DESCRIPTION" .IX Header "DESCRIPTION" This modules provides a number of name normalization routines, plus high-level parsing and name comparison utilities such as those illustrated in the synopsis. .PP While it tries to accommodate non-Western names, this module definitely works better with Western names, especially English-style names. .PP No subroutine is exported by default. .PP This modules normalizes names to this format: .PP Lastname(s) [Jr], Given name(s) .PP Some examples: .PP 1) Bourget, David Joseph Richard .PP 2) Bourget Jr, David .PP 3) Bourget, D. J. R. .PP These are all normalized names. This format is what is referred to as the normalized representation of a name here. .SH "SUBROUTINES" .IX Header "SUBROUTINES" .SS "abbreviationOf(string name1,string name2): boolean" .IX Subsection "abbreviationOf(string name1,string name2): boolean" Returns true iff name1 is a common abbreviation of name2 in English. For example, 'Dave' is a common abbreviation of 'David'. The list of abbreviations used includes a number of old abbreviations such as 'Davy' for 'David'. .SS "cleanName(string name): string" .IX Subsection "cleanName(string name): string" Like parseName, but a) returns the normalized form of the name instead of an array, and b) does additional cleaning-up. To be preferred to parseName in most cases and by default if processing variable or dubious data. .SS "composeName(string given, string last): string" .IX Subsection "composeName(string given, string last): string" Returns the name in the \*(L"last, given\*(R" format. .SS "isCommonFirstname(string name, [float threshold]): boolean" .IX Subsection "isCommonFirstname(string name, [float threshold]): boolean" Returns true if the name is among the 1000 most popular firstnames (male or female) according to the 1990 \s-1US\s0 Census. If a threshold percentage is passed, the name must have at least this frequency for the subroutine to return true. See http://www.census.gov/genealogy/www/data/1990surnames/names_files.html. .SS "isCommonSurname(string name, [float threshold]): boolean" .IX Subsection "isCommonSurname(string name, [float threshold]): boolean" Returns true if the name is among the 1000 most popular surnames according to the 1990 \s-1US\s0 Census. If a threshold percentage is passed, the name must have at least this frequency for the subroutine to return true. See http://www.census.gov/genealogy/www/data/1990surnames/names_files.html. .SS "firstnamePrevalence(string name): float [0\-100]" .IX Subsection "firstnamePrevalence(string name): float [0-100]" Returns a float between 0 and 100 indicating how common the firstname is according to the 1990 \s-1US\s0 Census. Names that are not in the top 1000 return 0. .SS "surnamePrevalence(string name): float [0\-100]" .IX Subsection "surnamePrevalence(string name): float [0-100]" Returns a float between 0 and 100 indicating how common the surname is according to the 1990 \s-1US\s0 Census. Names that are not in the top 1000 return 0. .SS "normalizeNameWhitespace(string name): string" .IX Subsection "normalizeNameWhitespace(string name): string" Normalizes the whitespace within a name. This is mainly for internal usage. .SS "parseName(string name): array" .IX Subsection "parseName(string name): array" Takes a name in one of the multiple formats that one can write a name in, and returns it as an array representing the post-comma and pre-comma parts of its normalized form (in that order). For example, parseName(\*(L"David Bourget\*(R") returns ('David','Bourget'). .SS "parseName2(string name): array" .IX Subsection "parseName2(string name): array" Use on already-normalized names to split them into four parts: full given names, initials, last names, and suffix. The only 'suffix' recognied is 'Jr'. .SS "parseNameList(array names): array" .IX Subsection "parseNameList(array names): array" Takes an array of names (as strings) and returns an array of normalized representations of the names in the array. .SS "parseNames(string names): array" .IX Subsection "parseNames(string names): array" Takes a string of names as parameter and returns an array of normalized representations of the names in the string. This routine understands a wide variety of formattings for names and lists of names typically found as list of authors in bibliographic citations. See the test 03\-parseNames.t for multiple examples. .SS "reverseName(string name): string" .IX Subsection "reverseName(string name): string" Given a normalized name of the form \*(L"last, given\*(R", returns \*(L"given last\*(R". .SS "samePerson(string name1, string name2): string" .IX Subsection "samePerson(string name1, string name2): string" Returns a true value iff name1 and name2 could reasonably be two writings of the same name. For example, 'D J Bourget' could reasonably be a writing of 'David Joseph Bourget'. So could 'D Bourget'. But 'D F Bourget' is not a reasonable writing of 'David Joseph Bourget'. The value returned is a (potentially new) name string which combines the most complete tokens of the two submitted name strings. .PP Contrary to what one might expect, this subroutine does not use \fBweakenings()\fR behind the scenes. Another way to check for name compatibility would be to check that two names have a weakening in common (probably too permissive for most purposes) or that one name is a weakening of the other. .SS "setNameAbbreviations(array): undef" .IX Subsection "setNameAbbreviations(array): undef" Sets the abbreviation mapping used to determine whether, say, 'David' and 'Dave' are compatible name parts. The mapping is also used by \fBabbreviationOf()\fR. The format of the array should be: 'Dave', 'David', 'Davy', 'David', etc, otherwise representable in Perl as 'Dave' => 'David', 'Davy' => 'David', etc. .SS "getNameAbbreviations" .IX Subsection "getNameAbbreviations" Returns the abbreviation mapping. .SS "weakenings(string first_name, string last_name): array" .IX Subsection "weakenings(string first_name, string last_name): array" Returns an array of normalized names which are weakenings of the first and last name passed as argument. Substituting a given names by an initial, or removing an initial, for example, are operations which generate weakenings of a name. Such operations are applied with arbitrary depth, until the name has been reduced to a single initial followed by the lastname, and all intermediary steps returned. .PP You can use weakenings(parseName(\*(L"Lastname, Firstname\*(R")) to weaken a first and last name as a single string. .SS "guessGender(string firstname, [float threshold]): string" .IX Subsection "guessGender(string firstname, [float threshold]): string" Returns 'F' if someone with the provided firstname is likely female, 'M' if likely male, and undef otherwise. A frequency threshold (default = 0) can be specified so that a gender is returned only if the name is found with at least this frequency among people with this gender (according to the \s-1US\s0 census). A threshold of 0.1 (which means 0.1%) ensures very reliable results (precision above 99%) with a recall of about 60%. When the threshold is lower, this function has a tendency to overestimate the number of females. .SH "EXPORT" .IX Header "EXPORT" None by default. .SH "KNOWN ISSUES" .IX Header "KNOWN ISSUES" This module currently overwrites \f(CW@Text::Capitalize::exceptions\fR globally, which can have unintended side-effects. .SH "SEE ALSO" .IX Header "SEE ALSO" The xPapers application framework from which this has been extracted, http://www.xpapers.org .PP The related Biblio::Citation::Compare module. .SH "AUTHOR" .IX Header "AUTHOR" David Bourget, http://www.dbourget.com, with contributions by Zbigniew Lukasiak .SH "COPYRIGHT AND LICENSE" .IX Header "COPYRIGHT AND LICENSE" Copyright (C) 2011\-2013 by David Bourget and University of London .PP This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.1 or, at your option, any later version of Perl 5 you may have available.