NAME¶
Regexp::List - builds regular expressions out of a list of words
SYNOPSIS¶
use Regexp::List;
my $l = Regexp::List->new;
my $re = $l->list2re(qw/foobar fooxar foozap fooza/);
# $re is now qr/foo(?:[bx]ar|zap?)/
ABSTRACT¶
This module offers "list2re" method that turns a list of words into an
optimized regular expression which matches all words therein.
The optimized regular expression is much more efficient than a simple-minded
'|'-concatenation thereof.
DESCRIPTION¶
This module use Object-Oriented approach so you can use this module as a base
and tweak its features. This module is a base class of Regexp::Optimizer.
EXPORT¶
Since this is an OO module there is no symbol exported.
METHODS¶
This module offers methods below;
- $l = Regexp::List->new(key=>value, ...)
- Constructor. When arguments are fed in key =>
value, manner, it sets attributes. See "$l->set" for
details
- $re = $l->list2re(list of words ...)
- Does the job. Takes a list of words and turn it into an
optimal regular expresson. See "IMPLEMENTATION" to find out how
it is achieved. If you want to know the underlying black magic even
further, see the source.
- $l->set(key => value, ...)
- Sets attributes. There are many attributes supported but
let me mention just a few that you may be interested.
- lookahead
- Whether you prepend a lookahead assertion or not. Default
value is 1. This module is smart enough to omit the assertion when you
don't need one.
$re = $l->list2re(qw/1 2 3 infinity/);
# qr/(?=[123i])(?:[123]|infinity)/
$re = $l->set(lookahead=>0)->list2re(qw/1 2 3 infinity/);
# qr/(?:[123]|infinity)/
- quotemeta
- Whether you quote metacharacters or not. Default is 1. If
you really need this feature try Regexp::Optimizer instead.
@list = qw/3 3.14 3.14159265358979/;
$re = $l->list2re(@list);
# qr/3(?:\.14(?:159265358979)?)?)/
$re = $l->set(lookahead=>0)->list2re(@list);
# qr/3(?:.14(?:159265358979)?)?)/
# which does match 3.14 but also "11+3=14"
- modifies
- Currently it accepts 'i', 'm', 's', and 'x', the same as
regular expression modifiers.
@list = qw/Perl perl BASIC basic/;
$re = $l->list2re(@list);
# qr/(?=[BPbp])(?:[Pp]erl|BASIC|basic)/
$re = $l->set(modifiers => 'i')->list2re(@list);
# qr/(?=[bp])(?:perl|basic)/i
print $l->set(modifiers => 'x')->list2re(@list);
# Try for yourself; Isn't itcute ?
- $l->expand($re);
- Utility methods to expand a regular expression. Handy when
you want to check the complex regexes.
- $l->unexpand($re);
- Utility methods to unexpand a regular expression.
IMPLEMENTATION¶
This module optimizes the regular expression as follows. Let's see what happens
when qw/foobar fooxar foozap fooza/ is fed
- trie building (common prefix aggregation)
- first the corresponding trie structure is built
+- bar
foo -+- xar
+- za -+- p
+- ''
- common suffix aggregation
- it checks if any leaf node can be optimized for common
suffix. In this case we can do so to "bar" and "xar".
+- b -+-ar
foo -+- x -+
+- za -+- p
+- ''
- character class conversion
- If a branch contains more than two single characters, it
turns it into a character class.
foo -+- [bx] --- ar
+- za -+-p
+- ''
- empty leaf to "?"
- Empty leaf is converted to a '?' quantifier
foo -+- [bx] --- ar
+- za -+-p?
- join all leafs into a group
- And the final result is reached.
foo(?:[bx]ar|zap?)
BENCHMARKS¶
This module is faily robust. You can practically use this module to find a
regular expression that matches all words in a dictionary. Here is a result by
on perl 5.8.0, FreeBSD 4-Stable, Pentium III 800 Mhz with 512 MB RAM.
# Sat May 31 09:11:06 2003 ( 0.000000 s) Reading /usr/share/dict/words
# Sat May 31 09:11:07 2003 ( 0.847797 s) 235881 lines read.
# Sat May 31 09:11:07 2003 ( 0.000000 s) Making regexp.
# Sat May 31 09:13:09 2003 ( 121.596928 s) Done.
# Sat May 31 09:13:09 2003 ( 0.000000 s) Saving to t/words.rx
# Sat May 31 09:13:09 2003 ( 0.000000 s) Reading t/words.rx
# Sat May 31 09:13:13 2003 ( 3.679176 s) Done.
# Sat May 31 09:13:13 2003 ( 0.000000 s) Opening /usr/share/dict/words for comparison.
# Sat May 31 09:13:13 2003 ( 0.255222 s) /usr/share/dict/words:235881 lines found.
# Sat May 31 09:13:13 2003 ( 0.000000 s) Showtime!
# 235881/235881
# Sat May 31 10:44:17 2003 ( 5464.370409 s) Done.
# Sat May 31 10:44:17 2003 ( 5464.370624 s) 43.167 matches/s
The result of optimization is obvious as the number of alteration increases.
Here is a result of a benchmark which matches randomly picked words against
"/usr/share/dict/words".
==== 2 words
Rate naive optim
naive 1.79/s -- -28%
optim 2.49/s 39% --
==== 256 words
s/iter naive optim
naive 31.7 -- -81%
optim 5.95 433% --
SEE ALSO¶
Regexp::Optimizer -- uses this module as its base
"eg/" directory in this package contains example scripts.
- Perl standard documents
- perltodo, perlre
- CPAN Modules
- Regexp::Presuf, Text::Trie
- Books
- Mastering Regular Expressions
<http://www.oreilly.com/catalog/regex2/>
AUTHOR¶
Dan Kogai <dankogai@dan.co.jp>
COPYRIGHT AND LICENSE¶
Copyright 2003 by Dan Kogai, All Rights Reserved.
This library is free software; you can redistribute it and/or modify it under
the same terms as Perl itself.