.\" Automatically generated by Pod::Man 4.10 (Pod::Simple 3.35) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{\ . if \nF \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" ======================================================================== .\" .IX Title "Mail::SpamAssassin::Plugin::TextCat 3pm" .TH Mail::SpamAssassin::Plugin::TextCat 3pm "2020-01-31" "perl v5.28.1" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" Mail::SpamAssassin::Plugin::TextCat \- TextCat language guesser .SH "SYNOPSIS" .IX Header "SYNOPSIS" .Vb 1 \& loadplugin Mail::SpamAssassin::Plugin::TextCat .Ve .SH "DESCRIPTION" .IX Header "DESCRIPTION" This plugin will try to guess the language used in the message body text. .PP You can use the \*(L"ok_languages\*(R" directive to set which languages are considered okay for incoming mail and if the guessed language is not okay, \&\f(CW\*(C`UNWANTED_LANGUAGE_BODY\*(C'\fR is triggered. Alternatively you can use the X\-Languages metadata header directly in rules. .PP It will always add the results to a \*(L"X\-Languages\*(R" name-value pair in the message metadata data structure. This may be useful as Bayes tokens and can also be used in rules for scoring. The results can also be added to marked-up messages using \*(L"add_header\*(R", with the _LANGUAGES_ tag. See Mail::SpamAssassin::Conf for details. .PP Note: the language cannot always be recognized with sufficient confidence. In that case, no action is taken. .PP You can use _TEXTCATRESULTS_ tag to view the internal ngram-scoring, it might help fine-tuning settings. .PP Examples of using X\-Languages header directly in rules: .PP .Vb 2 \& header OK_LANGS X\-Languages =~ /\eben\eb/ \& score OK_LANGS \-1 \& \& header BAD_LANGS X\-Languages =~ /\eb(?:ja|zh)\eb/ \& score BAD_LANGS 1 .Ve .SH "USER OPTIONS" .IX Header "USER OPTIONS" .IP "ok_languages xx [ yy zz ... ] (default: all)" 4 .IX Item "ok_languages xx [ yy zz ... ] (default: all)" This option is used to specify which languages are considered okay for incoming mail. SpamAssassin will try to detect the language used in the message body text. .Sp Note that the language cannot always be recognized with sufficient confidence. In that case, no action is taken. .Sp The rule \f(CW\*(C`UNWANTED_LANGUAGE_BODY\*(C'\fR is triggered if none of the languages detected are in the \*(L"ok\*(R" list. Note that this is the only effect of the \&\*(L"ok\*(R" list. It does not act as a whitelist against any other form of spam scanning. .Sp In your configuration, you must use the two or three letter language specifier in lowercase, not the English name for the language. You may also specify \f(CW\*(C`all\*(C'\fR if a desired language is not listed, or if you want to allow any language. The default setting is \f(CW\*(C`all\*(C'\fR. .Sp Examples: .Sp .Vb 3 \& ok_languages all (allow all languages) \& ok_languages en (only allow English) \& ok_languages en ja zh (allow English, Japanese, and Chinese) .Ve .Sp Note: if there are multiple ok_languages lines, only the last one is used. .Sp Select the languages to allow from the list below: .RS 4 .IP "af \- Afrikaans" 4 .IX Item "af - Afrikaans" .PD 0 .IP "am \- Amharic" 4 .IX Item "am - Amharic" .IP "ar \- Arabic" 4 .IX Item "ar - Arabic" .IP "be \- Byelorussian" 4 .IX Item "be - Byelorussian" .IP "bg \- Bulgarian" 4 .IX Item "bg - Bulgarian" .IP "bs \- Bosnian" 4 .IX Item "bs - Bosnian" .IP "ca \- Catalan" 4 .IX Item "ca - Catalan" .IP "cs \- Czech" 4 .IX Item "cs - Czech" .IP "cy \- Welsh" 4 .IX Item "cy - Welsh" .IP "da \- Danish" 4 .IX Item "da - Danish" .IP "de \- German" 4 .IX Item "de - German" .IP "el \- Greek" 4 .IX Item "el - Greek" .IP "en \- English" 4 .IX Item "en - English" .IP "eo \- Esperanto" 4 .IX Item "eo - Esperanto" .IP "es \- Spanish" 4 .IX Item "es - Spanish" .IP "et \- Estonian" 4 .IX Item "et - Estonian" .IP "eu \- Basque" 4 .IX Item "eu - Basque" .IP "fa \- Persian" 4 .IX Item "fa - Persian" .IP "fi \- Finnish" 4 .IX Item "fi - Finnish" .IP "fr \- French" 4 .IX Item "fr - French" .IP "fy \- Frisian" 4 .IX Item "fy - Frisian" .IP "ga \- Irish Gaelic" 4 .IX Item "ga - Irish Gaelic" .IP "gd \- Scottish Gaelic" 4 .IX Item "gd - Scottish Gaelic" .IP "he \- Hebrew" 4 .IX Item "he - Hebrew" .IP "hi \- Hindi" 4 .IX Item "hi - Hindi" .IP "hr \- Croatian" 4 .IX Item "hr - Croatian" .IP "hu \- Hungarian" 4 .IX Item "hu - Hungarian" .IP "hy \- Armenian" 4 .IX Item "hy - Armenian" .IP "id \- Indonesian" 4 .IX Item "id - Indonesian" .IP "is \- Icelandic" 4 .IX Item "is - Icelandic" .IP "it \- Italian" 4 .IX Item "it - Italian" .IP "ja \- Japanese" 4 .IX Item "ja - Japanese" .IP "ka \- Georgian" 4 .IX Item "ka - Georgian" .IP "ko \- Korean" 4 .IX Item "ko - Korean" .IP "la \- Latin" 4 .IX Item "la - Latin" .IP "lt \- Lithuanian" 4 .IX Item "lt - Lithuanian" .IP "lv \- Latvian" 4 .IX Item "lv - Latvian" .IP "mr \- Marathi" 4 .IX Item "mr - Marathi" .IP "ms \- Malay" 4 .IX Item "ms - Malay" .IP "ne \- Nepali" 4 .IX Item "ne - Nepali" .IP "nl \- Dutch" 4 .IX Item "nl - Dutch" .IP "no \- Norwegian" 4 .IX Item "no - Norwegian" .IP "pl \- Polish" 4 .IX Item "pl - Polish" .IP "pt \- Portuguese" 4 .IX Item "pt - Portuguese" .IP "qu \- Quechua" 4 .IX Item "qu - Quechua" .IP "rm \- Rhaeto-Romance" 4 .IX Item "rm - Rhaeto-Romance" .IP "ro \- Romanian" 4 .IX Item "ro - Romanian" .IP "ru \- Russian" 4 .IX Item "ru - Russian" .IP "sa \- Sanskrit" 4 .IX Item "sa - Sanskrit" .IP "sco \- Scots" 4 .IX Item "sco - Scots" .IP "sk \- Slovak" 4 .IX Item "sk - Slovak" .IP "sl \- Slovenian" 4 .IX Item "sl - Slovenian" .IP "sq \- Albanian" 4 .IX Item "sq - Albanian" .IP "sr \- Serbian" 4 .IX Item "sr - Serbian" .IP "sv \- Swedish" 4 .IX Item "sv - Swedish" .IP "sw \- Swahili" 4 .IX Item "sw - Swahili" .IP "ta \- Tamil" 4 .IX Item "ta - Tamil" .IP "th \- Thai" 4 .IX Item "th - Thai" .IP "tl \- Tagalog" 4 .IX Item "tl - Tagalog" .IP "tr \- Turkish" 4 .IX Item "tr - Turkish" .IP "uk \- Ukrainian" 4 .IX Item "uk - Ukrainian" .IP "vi \- Vietnamese" 4 .IX Item "vi - Vietnamese" .IP "yi \- Yiddish" 4 .IX Item "yi - Yiddish" .IP "zh \- Chinese (both Traditional and Simplified)" 4 .IX Item "zh - Chinese (both Traditional and Simplified)" .IP "zh.big5 \- Chinese (Traditional only)" 4 .IX Item "zh.big5 - Chinese (Traditional only)" .IP "zh.gb2312 \- Chinese (Simplified only)" 4 .IX Item "zh.gb2312 - Chinese (Simplified only)" .RE .RS 4 .PD .Sp .RE .IP "inactive_languages xx [ yy zz ... ] (default: see below)" 4 .IX Item "inactive_languages xx [ yy zz ... ] (default: see below)" This option is used to specify which languages will not be considered when trying to guess the language. For performance reasons, supported languages that have fewer than about 5 million speakers are disabled by default. Note that listing a language in \f(CW\*(C`ok_languages\*(C'\fR automatically enables it for that user. .Sp The default setting is: .RS 4 .IP "bs cy eo et eu fy ga gd is la lt lv rm sa sco sl yi" 4 .IX Item "bs cy eo et eu fy ga gd is la lt lv rm sa sco sl yi" .RE .RS 4 .Sp That list is Bosnian, Welsh, Esperanto, Estonian, Basque, Frisian, Irish Gaelic, Scottish Gaelic, Icelandic, Latin, Lithuanian, Latvian, Rhaeto-Romance, Sanskrit, Scots, Slovenian, and Yiddish. .RE .IP "textcat_max_languages N (default: 3)" 4 .IX Item "textcat_max_languages N (default: 3)" The maximum number of languages any one message can simultaneously match before its classification is considered unknown. You can try reducing this to 2 or possibly even 1 for more confident results, as it's unusual for a message to contain multiple languages. .Sp Read description for textcat_acceptable_score also, as these settings are closely related. Scoring affects how many languages might be matched and here we set the \*(L"false positive limit\*(R" where we think the engine can't decide what languages message really contain. .IP "textcat_optimal_ngrams N (default: 0)" 4 .IX Item "textcat_optimal_ngrams N (default: 0)" If the number of ngrams is lower than this number then they will be removed. This can be used to speed up the program for longer inputs. For shorter inputs, this should be set to 0. .IP "textcat_max_ngrams N (default: 400)" 4 .IX Item "textcat_max_ngrams N (default: 400)" The maximum number of ngrams that should be compared with each of the languages models (note that each of those models is used completely). .IP "textcat_acceptable_score N (default: 1.02)" 4 .IX Item "textcat_acceptable_score N (default: 1.02)" Include any language that scores at least \f(CW\*(C`textcat_acceptable_score\*(C'\fR in the returned list of languages. .Sp This setting is basically a percentile range. Any language having internal ngram-score within N\-percent of the best score is included into results. Larger values than 1.05 are not recommended as it can generate many false matches. A setting of 1.00 would mean a single best scoring language is always forcibly selected, but this is not recommended as then textcat_max_languages can't do its job classifying language as uncertain. .Sp Read the description for textcat_max_languages, as these are settings are closely related. .Sp You can use _TEXTCATRESULTS_ tag to view the internal ngram-scoring, it might help fine-tuning settings.