NAME¶
Lingua::StopWords - Stop words for several languages.
SYNOPSIS¶
use Lingua::StopWords qw( getStopWords );
my $stopwords = getStopWords('en');
my @words = qw( i am the walrus goo goo g'joob );
# prints "walrus goo goo g'joob"
print join ' ', grep { !$stopwords->{$_} } @words;
DESCRIPTION¶
In keyword search, it is common practice to suppress a collection of
"stopwords": words such as "the", "and",
"maybe", etc. which exist in in a large number of documents and do
not tell you anything important about any document which contains them. This
module provides such "stoplists" in several languages.
Supported Languages¶
|-----------------------------------------------------------|
| Language | ISO code | default encoding | also available |
|-----------------------------------------------------------|
| Danish | da | ISO-8859-1 | UTF-8 |
| Dutch | nl | ISO-8859-1 | UTF-8 |
| English | en | ISO-8859-1 | UTF-8 |
| Finnish | fi | ISO-8859-1 | UTF-8 |
| French | fr | ISO-8859-1 | UTF-8 |
| German | de | ISO-8859-1 | UTF-8 |
| Hungarian | hu | ISO-8859-1 | UTF-8 |
| Italian | it | ISO-8859-1 | UTF-8 |
| Norwegian | no | ISO-8859-1 | UTF-8 |
| Portuguese | pt | ISO-8859-1 | UTF-8 |
| Spanish | es | ISO-8859-1 | UTF-8 |
| Swedish | sv | ISO-8859-1 | UTF-8 |
| Russian | ru | KOI8-R | UTF-8 |
|-----------------------------------------------------------|
FUNCTIONS¶
getStopWords¶
my $stoplist = getStopWords('en');
my $utf8_stoplist = getStopWords('en', 'UTF-8');
Retrieve a stoplist in the form of a hashref where the keys are all stopwords
and the values are all 1.
$stoplist = {
and => 1,
if => 1,
# ...
};
getStopWords() expects 1-2 arguments. The first, which is required, is an
ISO code representing a supported language. If the ISO code cannot be found,
getStopWords returns undef.
The second argument should be 'UTF-8' if you want the stopwords encoded in
UTF-8. The UTF-8 flag will be turned on, so make sure you understand all the
implications of that.
SEE ALSO¶
The stoplists supplied by this module were created as part of the Snowball
project (see <
http://snowball.tartarus.org>, Lingua::Stem::Snowball).
Lingua::EN::StopWords provides a different stoplist for English.
AUTHOR¶
Maintained by Marvin Humphrey <marvin at rectangular dot com>. Original
author Fabien Potencier, <fabpot at cpan dot org>.
COPYRIGHT AND LICENSE¶
Copyright 2004-2008 Fabien Potencier, Marvin Humphrey
This library is free software; you can redistribute it and/or modify it under
the same terms as Perl itself, either Perl version 5.8.3 or, at your option,
any later version of Perl 5 you may have available.