.\" Automatically generated by Pod::Man 4.14 (Pod::Simple 3.40) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{\ . if \nF \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" .\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2). .\" Fear. Run. Save yourself. No user-serviceable parts. . \" fudge factors for nroff and troff .if n \{\ . ds #H 0 . ds #V .8m . ds #F .3m . ds #[ \f1 . ds #] \fP .\} .if t \{\ . ds #H ((1u-(\\\\n(.fu%2u))*.13m) . ds #V .6m . ds #F 0 . ds #[ \& . ds #] \& .\} . \" simple accents for nroff and troff .if n \{\ . ds ' \& . ds ` \& . ds ^ \& . ds , \& . ds ~ ~ . ds / .\} .if t \{\ . ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u" . ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u' . ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u' . ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u' . ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u' . ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u' .\} . \" troff and (daisy-wheel) nroff accents .ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V' .ds 8 \h'\*(#H'\(*b\h'-\*(#H' .ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#] .ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H' .ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u' .ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#] .ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#] .ds ae a\h'-(\w'a'u*4/10)'e .ds Ae A\h'-(\w'A'u*4/10)'E . \" corrections for vroff .if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u' .if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u' . \" for low resolution devices (crt and lpr) .if \n(.H>23 .if \n(.V>19 \ \{\ . ds : e . ds 8 ss . ds o a . ds d- d\h'-1'\(ga . ds D- D\h'-1'\(hy . ds th \o'bp' . ds Th \o'LP' . ds ae ae . ds Ae AE .\} .rm #[ #] #H #V #F C .\" ======================================================================== .\" .IX Title "Algorithm::NaiveBayes 3pm" .TH Algorithm::NaiveBayes 3pm "2021-01-07" "perl v5.32.0" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" Algorithm::NaiveBayes \- Bayesian prediction of categories .SH "SYNOPSIS" .IX Header "SYNOPSIS" .Vb 2 \& use Algorithm::NaiveBayes; \& my $nb = Algorithm::NaiveBayes\->new; \& \& $nb\->add_instance \& (attributes => {foo => 1, bar => 1, baz => 3}, \& label => \*(Aqsports\*(Aq); \& \& $nb\->add_instance \& (attributes => {foo => 2, blurp => 1}, \& label => [\*(Aqsports\*(Aq, \*(Aqfinance\*(Aq]); \& \& ... repeat for several more instances, then: \& $nb\->train; \& \& # Find results for unseen instances \& my $result = $nb\->predict \& (attributes => {bar => 3, blurp => 2}); .Ve .SH "DESCRIPTION" .IX Header "DESCRIPTION" This module implements the classic \*(L"Naive Bayes\*(R" machine learning algorithm. It is a well-studied probabilistic algorithm often used in automatic text categorization. Compared to other algorithms (kNN, \&\s-1SVM,\s0 Decision Trees), it's pretty fast and reasonably competitive in the quality of its results. .PP A paper by Fabrizio Sebastiani provides a really good introduction to text categorization: .SH "METHODS" .IX Header "METHODS" .IP "\fBnew()\fR" 4 .IX Item "new()" Creates a new \f(CW\*(C`Algorithm::NaiveBayes\*(C'\fR object and returns it. The following parameters are accepted: .RS 4 .IP "purge" 4 .IX Item "purge" If set to a true value, the \f(CW\*(C`do_purge()\*(C'\fR method will be invoked during \&\f(CW\*(C`train()\*(C'\fR. The default is true. Set this to a false value if you'd like to be able to add additional instances after training and then call \f(CW\*(C`train()\*(C'\fR again. .RE .RS 4 .RE .IP "add_instance( attributes => \s-1HASH,\s0 label => STRING|ARRAY )" 4 .IX Item "add_instance( attributes => HASH, label => STRING|ARRAY )" Adds a training instance to the categorizer. The \f(CW\*(C`attributes\*(C'\fR parameter contains a hash reference whose keys are string attributes and whose values are the weights of those attributes. For instance, if you're categorizing text documents, the attributes might be the words of the document, and the weights might be the number of times each word occurs in the document. .Sp The \f(CW\*(C`label\*(C'\fR parameter can contain a single string or an array of strings, with each string representing a label for this instance. The labels can be any arbitrary strings. To indicate that a document has no applicable labels, pass an empty array reference. .IP "\fBtrain()\fR" 4 .IX Item "train()" Calculates the probabilities that will be necessary for categorization using the \f(CW\*(C`predict()\*(C'\fR method. .IP "predict( attributes => \s-1HASH\s0 )" 4 .IX Item "predict( attributes => HASH )" Use this method to predict the label of an unknown instance. The attributes should be of the same format as you passed to \&\f(CW\*(C`add_instance()\*(C'\fR. \f(CW\*(C`predict()\*(C'\fR returns a hash reference whose keys are the names of labels, and whose values are the score for each label. Scores are between 0 and 1, where 0 means the label doesn't seem to apply to this instance, and 1 means it does. .Sp In practice, scores using Naive Bayes tend to be very close to 0 or 1 because of the way normalization is performed. I might try to alleviate this in future versions of the code. .IP "\fBlabels()\fR" 4 .IX Item "labels()" Returns a list of all the labels the object knows about (in no particular order), or the number of labels if called in a scalar context. .IP "\fBdo_purge()\fR" 4 .IX Item "do_purge()" Purges training instances and their associated information from the NaiveBayes object. This can save memory after training. .IP "\fBpurge()\fR" 4 .IX Item "purge()" Returns true or false depending on the value of the object's \f(CW\*(C`purge\*(C'\fR property. An optional boolean argument sets the property. .IP "save_state($path)" 4 .IX Item "save_state($path)" This object method saves the object to disk for later use. The \&\f(CW$path\fR argument indicates the place on disk where the object should be saved: .Sp .Vb 1 \& $nb\->save_state($path); .Ve .IP "restore_state($path)" 4 .IX Item "restore_state($path)" This class method reads the file specified by \f(CW$path\fR and returns the object that was previously stored there using \f(CW\*(C`save_state()\*(C'\fR: .Sp .Vb 1 \& $nb = Algorithm::NaiveBayes\->restore_state($path); .Ve .SH "THEORY" .IX Header "THEORY" Bayes' Theorem is a way of inverting a conditional probability. It states: .PP .Vb 3 \& P(y|x) P(x) \& P(x|y) = \-\-\-\-\-\-\-\-\-\-\-\-\- \& P(y) .Ve .PP The notation \f(CW\*(C`P(x|y)\*(C'\fR means "the probability of \f(CW\*(C`x\*(C'\fR given \f(CW\*(C`y\*(C'\fR." See also \&\*(L"/mathforum.org/dr.math/problems/battisfore.03.22.99.html\*(R"\*(L" in \*(R"http: for a simple but complete example of Bayes' Theorem. .PP In this case, we want to know the probability of a given category given a certain string of words in a document, so we have: .PP .Vb 3 \& P(words | cat) P(cat) \& P(cat | words) = \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- \& P(words) .Ve .PP We have applied Bayes' Theorem because \f(CW\*(C`P(cat | words)\*(C'\fR is a difficult quantity to compute directly, but \f(CW\*(C`P(words | cat)\*(C'\fR and \f(CW\*(C`P(cat)\*(C'\fR are accessible (see below). .PP The greater the expression above, the greater the probability that the given document belongs to the given category. So we want to find the maximum value. We write this as .PP .Vb 3 \& P(words | cat) P(cat) \& Best category = ArgMax \-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\- \& cat in cats P(words) .Ve .PP Since \f(CW\*(C`P(words)\*(C'\fR doesn't change over the range of categories, we can get rid of it. That's good, because we didn't want to have to compute these values anyway. So our new formula is: .PP .Vb 2 \& Best category = ArgMax P(words | cat) P(cat) \& cat in cats .Ve .PP Finally, we note that if \f(CW\*(C`w1, w2, ... wn\*(C'\fR are the words in the document, then this expression is equivalent to: .PP .Vb 2 \& Best category = ArgMax P(w1|cat)*P(w2|cat)*...*P(wn|cat)*P(cat) \& cat in cats .Ve .PP That's the formula I use in my document categorization code. The last step is the only non-rigorous one in the derivation, and this is the \&\*(L"naive\*(R" part of the Naive Bayes technique. It assumes that the probability of each word appearing in a document is unaffected by the presence or absence of each other word in the document. We assume this even though we know this isn't true: for example, the word \&\*(L"iodized\*(R" is far more likely to appear in a document that contains the word \*(L"salt\*(R" than it is to appear in a document that contains the word \&\*(L"subroutine\*(R". Luckily, as it turns out, making this assumption even when it isn't true may have little effect on our results, as the following paper by Pedro Domingos argues: \&\*(L"/www.cs.washington.edu/homes/pedrod/mlj97.ps.gz\*(R"\*(L" in \*(R"http: .SH "HISTORY" .IX Header "HISTORY" My first implementation of a Naive Bayes algorithm was in the now-obsolete AI::Categorize module, first released in May 2001. I replaced it with the Naive Bayes implementation in AI::Categorizer (note the extra 'r'), first released in July 2002. I then extracted that implementation into its own module that could be used outside the framework, and that's what you see here. .SH "AUTHOR" .IX Header "AUTHOR" Ken Williams, ken@mathforum.org .SH "COPYRIGHT" .IX Header "COPYRIGHT" Copyright 2003\-2004 Ken Williams. All rights reserved. .PP This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. .SH "SEE ALSO" .IX Header "SEE ALSO" \&\fBAI::Categorizer\fR\|(3), perl.