.\" Automatically generated by Pod::Man 4.09 (Pod::Simple 3.35) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .if !\nF .nr F 0 .if \nF>0 \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} .\} .\" ======================================================================== .\" .IX Title "Tagger 3pm" .TH Tagger 3pm "2018-09-30" "perl v5.26.2" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" Lingua::EN::Tagger \- Part\-of\-speech tagger for English natural language processing. .SH "SYNOPSIS" .IX Header "SYNOPSIS" .Vb 2 \& # Create a parser object \& my $p = new Lingua::EN::Tagger; \& \& # Add part of speech tags to a text \& my $tagged_text = $p\->add_tags($text); \& \& ... \& \& # Get a list of all nouns and noun phrases with occurrence counts \& my %word_list = $p\->get_words($text); \& \& ... \& \& # Get a readable version of the tagged text \& my $readable_text = $p\->get_readable($text); .Ve .SH "DESCRIPTION" .IX Header "DESCRIPTION" The module is a probability based, corpus-trained tagger that assigns \s-1POS\s0 tags to English text based on a lookup dictionary and a set of probability values. The tagger assigns appropriate tags based on conditional probabilities \- it examines the preceding tag to determine the appropriate tag for the current word. Unknown words are classified according to word morphology or can be set to be treated as nouns or other parts of speech. .PP The tagger also extracts as many nouns and noun phrases as it can, using a set of regular expressions. .SH "CONSTRUCTOR" .IX Header "CONSTRUCTOR" .ie n .IP "new %PARAMS" 4 .el .IP "new \f(CW%PARAMS\fR" 4 .IX Item "new %PARAMS" Class constructor. Takes a hash with the following parameters (shown with default values): .RS 4 .IP "unknown_word_tag => ''" 4 .IX Item "unknown_word_tag => ''" Tag to assign to unknown words .IP "stem => 0" 4 .IX Item "stem => 0" Stem single words using Lingua::Stem::EN .IP "weight_noun_phrases => 0" 4 .IX Item "weight_noun_phrases => 0" When returning occurrence counts for a noun phrase, multiply the value by the number of words in the \s-1NP.\s0 .IP "longest_noun_phrase => 5" 4 .IX Item "longest_noun_phrase => 5" Will ignore noun phrases longer than this threshold. This affects only the \fIget_words()\fR and \fIget_nouns()\fR methods. .IP "relax => 0" 4 .IX Item "relax => 0" Relax the Hidden Markov Model: this may improve accuracy for uncommon words, particularly words used polysemously .RE .RS 4 .RE .SH "METHODS" .IX Header "METHODS" .IP "add_tags \s-1TEXT\s0" 4 .IX Item "add_tags TEXT" Examine the string provided and return it fully tagged (\s-1XML\s0 style) .IP "add_tags_incrementally \s-1TEXT\s0" 4 .IX Item "add_tags_incrementally TEXT" Examine the string provided and return it fully tagged (\s-1XML\s0 style) but do not reset the internal part-of-speech state between invocations. .IP "get_words \s-1TEXT\s0" 4 .IX Item "get_words TEXT" Given a text string, return as many nouns and noun phrases as possible. Applies add_tags and involves three stages: .RS 4 .Sp .Vb 3 \& * Tag the text \& * Extract all the maximal noun phrases \& * Recursively extract all noun phrases from the MNPs .Ve .RE .RS 4 .RE .IP "get_readable \s-1TEXT\s0" 4 .IX Item "get_readable TEXT" Return an easy-on-the-eyes tagged version of a text string. Applies add_tags and reformats to be easier to read. .IP "get_sentences \s-1TEXT\s0" 4 .IX Item "get_sentences TEXT" Returns an anonymous array of sentences (without \s-1POS\s0 tags) from a text. .IP "get_proper_nouns \s-1TAGGED_TEXT\s0" 4 .IX Item "get_proper_nouns TAGGED_TEXT" Given a POS-tagged text, this method returns a hash of all proper nouns and their occurrence frequencies. The method is greedy and will return multi-word phrases, if possible, so it would find ``Linguistic Data Consortium'' as a single unit, rather than as three individual proper nouns. This method does not stem the found words. .IP "get_nouns \s-1TAGGED_TEXT\s0" 4 .IX Item "get_nouns TAGGED_TEXT" Given a POS-tagged text, this method returns all nouns and their occurrence frequencies. .IP "get_max_noun_phrases \s-1TAGGED_TEXT\s0" 4 .IX Item "get_max_noun_phrases TAGGED_TEXT" Given a POS-tagged text, this method returns only the maximal noun phrases. May be called directly, but is also used by get_noun_phrases .IP "get_noun_phrases \s-1TAGGED_TEXT\s0" 4 .IX Item "get_noun_phrases TAGGED_TEXT" Similar to get_words, but requires a POS-tagged text as an argument. .IP "install" 4 .IX Item "install" Reads some included corpus data and saves it in a stored hash on the local file system. This is called automatically if the tagger can't find the stored lexicon. .SH "AUTHORS" .IX Header "AUTHORS" .Vb 1 \& Aaron Coburn .Ve .SH "CONTRIBUTORS" .IX Header "CONTRIBUTORS" .Vb 2 \& Maciej Ceglowski \& Eric Nichols, Nara Institute of Science and Technology .Ve .SH "COPYRIGHT AND LICENSE" .IX Header "COPYRIGHT AND LICENSE" .Vb 1 \& Copyright 2003\-2010 Aaron Coburn \& \& This program is free software; you can redistribute it and/or modify \& it under the terms of version 3 of the GNU General Public License as \& published by the Free Software Foundation. .Ve