.\" Automatically generated by Pod::Man 2.28 (Pod::Simple 3.28) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is turned on, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{ . if \nF \{ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" .\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2). .\" Fear. Run. Save yourself. No user-serviceable parts. . \" fudge factors for nroff and troff .if n \{\ . ds #H 0 . ds #V .8m . ds #F .3m . ds #[ \f1 . ds #] \fP .\} .if t \{\ . ds #H ((1u-(\\\\n(.fu%2u))*.13m) . ds #V .6m . ds #F 0 . ds #[ \& . ds #] \& .\} . \" simple accents for nroff and troff .if n \{\ . ds ' \& . ds ` \& . ds ^ \& . ds , \& . ds ~ ~ . ds / .\} .if t \{\ . ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u" . ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u' . ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u' . ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u' . ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u' . ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u' .\} . \" troff and (daisy-wheel) nroff accents .ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V' .ds 8 \h'\*(#H'\(*b\h'-\*(#H' .ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#] .ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H' .ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u' .ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#] .ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#] .ds ae a\h'-(\w'a'u*4/10)'e .ds Ae A\h'-(\w'A'u*4/10)'E . \" corrections for vroff .if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u' .if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u' . \" for low resolution devices (crt and lpr) .if \n(.H>23 .if \n(.V>19 \ \{\ . ds : e . ds 8 ss . ds o a . ds d- d\h'-1'\(ga . ds D- D\h'-1'\(hy . ds th \o'bp' . ds Th \o'LP' . ds ae ae . ds Ae AE .\} .rm #[ #] #H #V #F C .\" ======================================================================== .\" .IX Title "Lucy::Docs::Tutorial::Analysis 3pm" .TH Lucy::Docs::Tutorial::Analysis 3pm "2015-03-06" "perl v5.20.2" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" Lucy::Docs::Tutorial::Analysis \- How to choose and use Analyzers. .SH "DESCRIPTION" .IX Header "DESCRIPTION" Try swapping out the PolyAnalyzer in our Schema for a RegexTokenizer: .PP .Vb 4 \& my $tokenizer = Lucy::Analysis::RegexTokenizer\->new; \& my $type = Lucy::Plan::FullTextType\->new( \& analyzer => $tokenizer, \& ); .Ve .PP Search for \f(CW\*(C`senate\*(C'\fR, \f(CW\*(C`Senate\*(C'\fR, and \f(CW\*(C`Senator\*(C'\fR before and after making the change and re-indexing. .PP Under PolyAnalyzer, the results are identical for all three searches, but under RegexTokenizer, searches are case-sensitive, and the result sets for \&\f(CW\*(C`Senate\*(C'\fR and \f(CW\*(C`Senator\*(C'\fR are distinct. .SS "PolyAnalyzer" .IX Subsection "PolyAnalyzer" What's happening is that PolyAnalyzer is performing more aggressive processing than RegexTokenizer. In addition to tokenizing, it's also converting all text to lower case so that searches are case-insensitive, and using a \*(L"stemming\*(R" algorithm to reduce related words to a common stem (\f(CW\*(C`senat\*(C'\fR, in this case). .PP PolyAnalyzer is actually multiple Analyzers wrapped up in a single package. In this case, it's three-in-one, since specifying a PolyAnalyzer with \&\f(CW\*(C`language => \*(Aqen\*(Aq\*(C'\fR is equivalent to this snippet: .PP .Vb 6 \& my $case_folder = Lucy::Analysis::CaseFolder\->new; \& my $tokenizer = Lucy::Analysis::RegexTokenizer\->new; \& my $stemmer = Lucy::Analysis::SnowballStemmer\->new( language => \*(Aqen\*(Aq ); \& my $polyanalyzer = Lucy::Analysis::PolyAnalyzer\->new( \& analyzers => [ $case_folder, $tokenizer, $stemmer ], \& ); .Ve .PP You can add or subtract Analyzers from there if you like. Try adding a fourth Analyzer, a SnowballStopFilter for suppressing \*(L"stopwords\*(R" like \*(L"the\*(R", \*(L"if\*(R", and \*(L"maybe\*(R". .PP .Vb 6 \& my $stopfilter = Lucy::Analysis::SnowballStopFilter\->new( \& language => \*(Aqen\*(Aq, \& ); \& my $polyanalyzer = Lucy::Analysis::PolyAnalyzer\->new( \& analyzers => [ $case_folder, $tokenizer, $stopfilter, $stemmer ], \& ); .Ve .PP Also, try removing the SnowballStemmer. .PP .Vb 3 \& my $polyanalyzer = Lucy::Analysis::PolyAnalyzer\->new( \& analyzers => [ $case_folder, $tokenizer ], \& ); .Ve .PP The original choice of a stock English PolyAnalyzer probably still yields the best results for this document collection, but you get the idea: sometimes you want a different Analyzer. .SS "When the best Analyzer is no Analyzer" .IX Subsection "When the best Analyzer is no Analyzer" Sometimes you don't want an Analyzer at all. That was true for our \*(L"url\*(R" field because we didn't need it to be searchable, but it's also true for certain types of searchable fields. For instance, \*(L"category\*(R" fields are often set up to match exactly or not at all, as are fields like \*(L"last_name\*(R" (because you may not want to conflate results for \*(L"Humphrey\*(R" and \*(L"Humphries\*(R"). .PP To specify that there should be no analysis performed at all, use StringType: .PP .Vb 2 \& my $type = Lucy::Plan::StringType\->new; \& $schema\->spec_field( name => \*(Aqcategory\*(Aq, type => $type ); .Ve .SS "Highlighting up next" .IX Subsection "Highlighting up next" In our next tutorial chapter, Lucy::Docs::Tutorial::Highlighter, we'll add highlighted excerpts from the \*(L"content\*(R" field to our search results.