.\" Automatically generated by Pod::Man 2.25 (Pod::Simple 3.16) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is turned on, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .ie \nF \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . nr % 0 . rr F .\} .el \{\ . de IX .. .\} .\" .\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2). .\" Fear. Run. Save yourself. No user-serviceable parts. . \" fudge factors for nroff and troff .if n \{\ . ds #H 0 . ds #V .8m . ds #F .3m . ds #[ \f1 . ds #] \fP .\} .if t \{\ . ds #H ((1u-(\\\\n(.fu%2u))*.13m) . ds #V .6m . ds #F 0 . ds #[ \& . ds #] \& .\} . \" simple accents for nroff and troff .if n \{\ . ds ' \& . ds ` \& . ds ^ \& . ds , \& . ds ~ ~ . ds / .\} .if t \{\ . ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u" . ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u' . ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u' . ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u' . ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u' . ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u' .\} . \" troff and (daisy-wheel) nroff accents .ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V' .ds 8 \h'\*(#H'\(*b\h'-\*(#H' .ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#] .ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H' .ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u' .ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#] .ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#] .ds ae a\h'-(\w'a'u*4/10)'e .ds Ae A\h'-(\w'A'u*4/10)'E . \" corrections for vroff .if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u' .if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u' . \" for low resolution devices (crt and lpr) .if \n(.H>23 .if \n(.V>19 \ \{\ . ds : e . ds 8 ss . ds o a . ds d- d\h'-1'\(ga . ds D- D\h'-1'\(hy . ds th \o'bp' . ds Th \o'LP' . ds ae ae . ds Ae AE .\} .rm #[ #] #H #V #F C .\" ======================================================================== .\" .IX Title "Regexp::List 3pm" .TH Regexp::List 3pm "2011-11-16" "perl v5.14.2" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" Regexp::List \- builds regular expressions out of a list of words .SH "SYNOPSIS" .IX Header "SYNOPSIS" .Vb 4 \& use Regexp::List; \& my $l = Regexp::List\->new; \& my $re = $l\->list2re(qw/foobar fooxar foozap fooza/); \& # $re is now qr/foo(?:[bx]ar|zap?)/ .Ve .SH "ABSTRACT" .IX Header "ABSTRACT" This module offers \f(CW\*(C`list2re\*(C'\fR method that turns a list of words into an optimized regular expression which matches all words therein. .PP The optimized regular expression is much more efficient than a simple-minded '|'\-concatenation thereof. .SH "DESCRIPTION" .IX Header "DESCRIPTION" This module use Object-Oriented approach so you can use this module as a base and tweak its features. This module is a base class of Regexp::Optimizer. .SS "\s-1EXPORT\s0" .IX Subsection "EXPORT" Since this is an \s-1OO\s0 module there is no symbol exported. .SH "METHODS" .IX Header "METHODS" This module offers methods below; .ie n .IP "$l = Regexp::List\->new(\fIkey=>value, ...\fR)" 4 .el .IP "\f(CW$l\fR = Regexp::List\->new(\fIkey=>value, ...\fR)" 4 .IX Item "$l = Regexp::List->new(key=>value, ...)" Constructor. When arguments are fed in \fIkey => value,\fR manner, it sets attributes. See \f(CW\*(C`$l\->set\*(C'\fR for details .ie n .IP "$re = $l\->list2re(\fIlist of words ...\fR)" 4 .el .IP "\f(CW$re\fR = \f(CW$l\fR\->list2re(\fIlist of words ...\fR)" 4 .IX Item "$re = $l->list2re(list of words ...)" Does the job. Takes a list of words and turn it into an optimal regular expresson. See \*(L"\s-1IMPLEMENTATION\s0\*(R" to find out how it is achieved. If you want to know the underlying black magic even further, see the source. .ie n .IP "$l\->set(\fIkey => value, ...\fR)" 4 .el .IP "\f(CW$l\fR\->set(\fIkey => value, ...\fR)" 4 .IX Item "$l->set(key => value, ...)" Sets attributes. There are many attributes supported but let me mention just a few that you may be interested. .RS 4 .IP "lookahead" 4 .IX Item "lookahead" Whether you prepend a lookahead assertion or not. Default value is 1. This module is smart enough to omit the assertion when you don't need one. .Sp .Vb 4 \& $re = $l\->list2re(qw/1 2 3 infinity/); \& # qr/(?=[123i])(?:[123]|infinity)/ \& $re = $l\->set(lookahead=>0)\->list2re(qw/1 2 3 infinity/); \& # qr/(?:[123]|infinity)/ .Ve .IP "quotemeta" 4 .IX Item "quotemeta" Whether you quote metacharacters or not. Default is 1. If you really need this feature try Regexp::Optimizer instead. .Sp .Vb 6 \& @list = qw/3 3.14 3.14159265358979/; \& $re = $l\->list2re(@list); \& # qr/3(?:\e.14(?:159265358979)?)?)/ \& $re = $l\->set(lookahead=>0)\->list2re(@list); \& # qr/3(?:.14(?:159265358979)?)?)/ \& # which does match 3.14 but also "11+3=14" .Ve .IP "modifies" 4 .IX Item "modifies" Currently it accepts 'i', 'm', 's', and 'x', the same as regular expression modifiers. .Sp .Vb 7 \& @list = qw/Perl perl BASIC basic/; \& $re = $l\->list2re(@list); \& # qr/(?=[BPbp])(?:[Pp]erl|BASIC|basic)/ \& $re = $l\->set(modifiers => \*(Aqi\*(Aq)\->list2re(@list); \& # qr/(?=[bp])(?:perl|basic)/i \& print $l\->set(modifiers => \*(Aqx\*(Aq)\->list2re(@list); \& # Try for yourself; Isn\*(Aqt itcute ? .Ve .RE .RS 4 .RE .ie n .IP "$l\->expand($re);" 4 .el .IP "\f(CW$l\fR\->expand($re);" 4 .IX Item "$l->expand($re);" Utility methods to expand a regular expression. Handy when you want to check the complex regexes. .ie n .IP "$l\->unexpand($re);" 4 .el .IP "\f(CW$l\fR\->unexpand($re);" 4 .IX Item "$l->unexpand($re);" Utility methods to unexpand a regular expression. .SH "IMPLEMENTATION" .IX Header "IMPLEMENTATION" This module optimizes the regular expression as follows. Let's see what happens when qw/foobar fooxar foozap fooza/ is fed .IP "trie building (common prefix aggregation)" 4 .IX Item "trie building (common prefix aggregation)" first the corresponding \fItrie\fR structure is built .Sp .Vb 4 \& +\- bar \& foo \-+\- xar \& +\- za \-+\- p \& +\- \*(Aq\*(Aq .Ve .IP "common suffix aggregation" 4 .IX Item "common suffix aggregation" it checks if any leaf node can be optimized for common suffix. In this case we can do so to \*(L"bar\*(R" and \*(L"xar\*(R". .Sp .Vb 4 \& +\- b \-+\-ar \& foo \-+\- x \-+ \& +\- za \-+\- p \& +\- \*(Aq\*(Aq .Ve .IP "character class conversion" 4 .IX Item "character class conversion" If a branch contains more than two single characters, it turns it into a character class. .Sp .Vb 3 \& foo \-+\- [bx] \-\-\- ar \& +\- za \-+\-p \& +\- \*(Aq\*(Aq .Ve .ie n .IP "empty leaf to ""?""" 4 .el .IP "empty leaf to \f(CW?\fR" 4 .IX Item "empty leaf to ?" Empty leaf is converted to a '?' quantifier .Sp .Vb 2 \& foo \-+\- [bx] \-\-\- ar \& +\- za \-+\-p? .Ve .IP "join all leafs into a group" 4 .IX Item "join all leafs into a group" And the final result is reached. .Sp .Vb 1 \& foo(?:[bx]ar|zap?) .Ve .SH "BENCHMARKS" .IX Header "BENCHMARKS" This module is faily robust. You can practically use this module to find a regular expression that matches all words in a dictionary. Here is a result by on perl 5.8.0, FreeBSD 4\-Stable, Pentium \s-1III\s0 800 Mhz with 512 \s-1MB\s0 \s-1RAM\s0. .PP .Vb 10 \& # Sat May 31 09:11:06 2003 ( 0.000000 s) Reading /usr/share/dict/words \& # Sat May 31 09:11:07 2003 ( 0.847797 s) 235881 lines read. \& # Sat May 31 09:11:07 2003 ( 0.000000 s) Making regexp. \& # Sat May 31 09:13:09 2003 ( 121.596928 s) Done. \& # Sat May 31 09:13:09 2003 ( 0.000000 s) Saving to t/words.rx \& # Sat May 31 09:13:09 2003 ( 0.000000 s) Reading t/words.rx \& # Sat May 31 09:13:13 2003 ( 3.679176 s) Done. \& # Sat May 31 09:13:13 2003 ( 0.000000 s) Opening /usr/share/dict/words for comparison. \& # Sat May 31 09:13:13 2003 ( 0.255222 s) /usr/share/dict/words:235881 lines found. \& # Sat May 31 09:13:13 2003 ( 0.000000 s) Showtime! \& # 235881/235881 \& # Sat May 31 10:44:17 2003 ( 5464.370409 s) Done. \& # Sat May 31 10:44:17 2003 ( 5464.370624 s) 43.167 matches/s .Ve .PP The result of optimization is obvious as the number of alteration increases. Here is a result of a benchmark which matches randomly picked words against \f(CW\*(C`/usr/share/dict/words\*(C'\fR. .PP .Vb 4 \& ==== 2 words \& Rate naive optim \& naive 1.79/s \-\- \-28% \& optim 2.49/s 39% \-\- \& \& ==== 256 words \& s/iter naive optim \& naive 31.7 \-\- \-81% \& optim 5.95 433% \-\- .Ve .SH "SEE ALSO" .IX Header "SEE ALSO" Regexp::Optimizer \*(-- uses this module as its base .PP \&\f(CW\*(C`eg/\*(C'\fR directory in this package contains example scripts. .IP "Perl standard documents" 4 .IX Item "Perl standard documents" perltodo, perlre .IP "\s-1CPAN\s0 Modules" 4 .IX Item "CPAN Modules" Regexp::Presuf, Text::Trie .IP "Books" 4 .IX Item "Books" Mastering Regular Expressions .SH "AUTHOR" .IX Header "AUTHOR" Dan Kogai .SH "COPYRIGHT AND LICENSE" .IX Header "COPYRIGHT AND LICENSE" Copyright 2003 by Dan Kogai, All Rights Reserved. .PP This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.