NAME¶
Regexp::Optimizer - optimizes regular expressions
SYNOPSIS¶
use Regexp::Optimizer;
my $o = Regexp::Optimizer->new;
my $re = $o->optimize(qr/foobar|fooxar|foozap/);
# $re is now qr/foo(?:[bx]ar|zap)/
ABSTRACT¶
This module does, ahem, attempts to, optimize regular expressions.
INSTALLATION¶
To install this module type the following:
perl Makefile.PL
make
make test
make install
DESCRIPTION¶
Here is a quote from perltodo.
Factoring out common suffices/prefices in
regexps (trie optimization)
Currently, the user has to optimize "foo|far" and "foo|goo"
into "f(?:oo|ar)" and "[fg]oo" by hand; this could be done
automatically.
This module implements just that.
EXPORT¶
Since this is an OO module there is no symbol exported.
METHODS¶
This module is implemented as a subclass of Regexp::List. For methods not listed
here, see Regexp::List.
- $o = Regexp::Optimizer->new;
- $o->set(key => value, ...)
- Just the same us Regexp::List except for the attribute
below;
- unexpand
- When set to one, $o->optimize() tries to
$o->expand before actually starting the operation.
# cases you need to set expand => 1
$o->set(expand => 1)->optimize(qr/
foobar|
fooxar|
foozar
/x);
- $re = $o->optimize(regexp);
- Does the job. Note that unlike "->list2re()"
in Regexp::List, the argument is the regular expression itself. What it
basically does is to find groups will alterations and replace it with the
result of "$o->list2re".
- $re = $o->list2re(list of words ...)
- Same as "list2re()" in Regexp::List in terms of
functionality but how it tokenize "atoms" is different since the
arguments can be regular expressions, not just strings. Here is a brief
example.
my @expr = qw/foobar fooba+/;
Regexp::List->new->list2re(@expr) eq qr/fooba[\+r]/;
Regexp::Optimizer->new->list2re(@expr) eq qr/foob(?:a+ar)/;
CAVEATS¶
This module is still experimental. Do not assume that the result is the same as
the unoptimized version.
- •
- When you just want a regular expression which matches
normal words with not metacharacters, use <Regexp::List>. It's more
robus and much faster.
- •
- When you have a list of regular expessions which you want
to aggregate, use "list2re" of THIS MODULE.
- •
- Use "->optimize()" when and only when you
already have a big regular expression with alterations therein.
"->optimize()" does support nested groups but its parser is not
tested very well.
BUGS¶
- •
- Regex parser in this module (which itself is implemented by
regular expression) is not as thoroughly tested as Regexp::List
- •
- May still fall into deep recursion when you attempt to
optimize deeply nested regexp. See "PRACTICALITY".
- •
- Does not grok (?{expression}) and (?(cond)yes|no)
constructs yet
- •
- You need to escape characters in character classes.
$o->optimize(qr/[a-z()]|[A-Z]/); # wrong
$o->optimize(qr/[a-z\(\)]|[A-Z]/); # right
$o->optimize(qr/[0-9A-Za-z]|[\Q-_.!~*"'()\E]/ # right, too.
- •
- When character(?: class(?:es)?)? are aggregated, duplicate
ranges are left as is. Though functionally OK, it is cosmetically ugly.
$o->optimize(qr/[0-5]|[5-9]|0123456789/);
# simply turns into [0-5][5-9]0123456789] not [0-9]
I left it that way because marking-rearranging approach can result a
humongous result when unicode characters are concerned (and
\p{Properties}).
PRACTICALITY¶
Though this module is still experimental, It is still good enough even for such
deeply nested regexes as the followng.
# See 3.2.2 of http://www.ietf.org/rfc/rfc2616.txt
# BNF faithfully turned into a regex
http://(?:(?:(?:(?:(?:[a-z]|[A-Z])|[0-9])|(?:(?:[a-z]|[A-Z])|[0-9])(?:(?:(?:[a-z]|[A-Z])|[0-9])|-)*(?:(?:[a-z]|[A-Z])|[0-9]))\.)*(?:(?:[a-z]|[A-Z])|(?:[a-z]|[A-Z])(?:(?:(?:[a-z]|[A-Z])|[0-9])|-)*(?:(?:[a-z]|[A-Z])|[0-9]))\.?|[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)(?::[0-9]*)?(?:/(?:(?:(?:(?:[a-z]|[A-Z])|[0-9])|[\-\_\.\!\~\*\'\(\)])|%(?:[0-9]|[A-Fa-f])(?:[0-9]|[A-Fa-f])|[:@&=+$,])*(?:;(?:(?:(?:(?:[a-z]|[A-Z])|[0-9])|[\-\_\.\!\~\*\'\(\)])|%(?:[0-9]|[A-Fa-f])(?:[0-9]|[A-Fa-f])|[:@&=+$,])*)*(?:/(?:(?:(?:(?:[a-z]|[A-Z])|[0-9])|[\-\_\.\!\~\*\'\(\)])|%(?:[0-9]|[A-Fa-f])(?:[0-9]|[A-Fa-f])|[:@&=+$,])*(?:;(?:(?:(?:(?:[a-z]|[A-Z])|[0-9])|[\-\_\.\!\~\*\'\(\)])|%(?:[0-9]|[A-Fa-f])(?:[0-9]|[A-Fa-f])|[:@&=+$,])*)*)*(?:\\?(?:[;/?:@&=+$,]|(?:(?:(?:[a-z]|[A-Z])|[0-9])|[\-\_\.\!\~\*\'\(\)])|%(?:[0-9]|[A-Fa-f])(?:[0-9]|[A-Fa-f]))*)?)?
# and optimized
http://(?::?[a-zA-Z0-9](?:[a-zA-Z0-9\-]*[a-zA-Z0-9])?\.[a-zA-Z]*(?:[a-zA-Z0-9\-]*[a-zA-Z0-9])?\.?|[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+)(?::[0-9]*)?(?:/(?:(?:(?:[a-zA-Z0-9\-\_\.\!\~\*\'\x28\x29]|%[0-9A-Fa-f][0-9A-Fa-f])|[:@&=+$,]))*(?:;(?:(?:(?:[a-zA-Z0-9\-\_\.\!\~\*\'\x28\x29]|%[0-9A-Fa-f][0-9A-Fa-f])|[:@&=+$,]))*)*(?:/(?:(?:(?:[a-zA-Z0-9\-\_\.\!\~\*\'\x28\x29]|%[0-9A-Fa-f][0-9A-Fa-f])|[:@&=+$,]))*(?:;(?:(?:(?:[a-zA-Z0-9\-\_\.\!\~\*\'\x28\x29]|%[0-9A-Fa-f][0-9A-Fa-f])|[:@&=+$,]))*)*)*(?:\\?(?:(?:[;/?:@&=+$,a-zA-Z0-9\-\_\.\!\~\*\'\x28\x29]|%[0-9A-Fa-f][0-9A-Fa-f]))*)?)?
By carefully examine both you can find that character classes are properly
aggregated.
SEE ALSO¶
Regexp::List -- upon which this module is based
"eg/" directory in this package contains example scripts.
- Perl standard documents
-
L<perltodo>, L<perlre>
- CPAN Modules
- Regexp::Presuf, Text::Trie
- Books
- Mastering Regular Expressions
<http://www.oreilly.com/catalog/regex2/>
AUTHOR¶
Dan Kogai <dankogai@dan.co.jp>
COPYRIGHT AND LICENSE¶
Copyright 2003 by Dan Kogai
This library is free software; you can redistribute it and/or modify it under
the same terms as Perl itself.