NAME¶
Perl::Critic::Policy::RegularExpressions::ProhibitComplexRegexes - Split long
regexps into smaller "qr//" chunks.
AFFILIATION¶
This Policy is part of the core Perl::Critic distribution.
DESCRIPTION¶
Big regexps are hard to read, perhaps even the hardest part of Perl. A good
practice to write digestible chunks of regexp and put them together. This
policy flags any regexp that is longer than "N" characters, where
"N" is a configurable value that defaults to 60. If the regexp uses
the "x" flag, then the length is computed after parsing out any
comments or whitespace.
Unfortunately the use of descriptive (and therefore longish) variable names can
cause regexps to be in violation of this policy, so interpolated variables are
counted as 4 characters no matter how long their names actually are.
CASE STUDY¶
As an example, look at the regexp used to match email addresses in
Email::Valid::Loose (tweaked lightly to wrap for POD)
(?x-ism:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]
\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015
"]*)*")(?:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[
\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n
\015"]*)*")|\.)*\@(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,
;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\]
)(?:\.(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000
-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\]))*)
which is constructed from the following code:
my $esc = '\\\\';
my $period = '\.';
my $space = '\040';
my $open_br = '\[';
my $close_br = '\]';
my $nonASCII = '\x80-\xff';
my $ctrl = '\000-\037';
my $cr_list = '\n\015';
my $qtext = qq/[^$esc$nonASCII$cr_list\"]/; # "
my $dtext = qq/[^$esc$nonASCII$cr_list$open_br$close_br]/;
my $quoted_pair = qq<$esc>.qq<[^$nonASCII]>;
my $atom_char = qq/[^($space)<>\@,;:\".$esc$open_br$close_br$ctrl$nonASCII]/;# "
my $atom = qq<$atom_char+(?!$atom_char)>;
my $quoted_str = qq<\"$qtext*(?:$quoted_pair$qtext*)*\">; # "
my $word = qq<(?:$atom|$quoted_str)>;
my $domain_ref = $atom;
my $domain_lit = qq<$open_br(?:$dtext|$quoted_pair)*$close_br>;
my $sub_domain = qq<(?:$domain_ref|$domain_lit)>;
my $domain = qq<$sub_domain(?:$period$sub_domain)*>;
my $local_part = qq<$word(?:$word|$period)*>; # This part is modified
$Addr_spec_re = qr<$local_part\@$domain>;
If you read the code from bottom to top, it is quite readable. And, you can even
see the one violation of RFC822 that Tatsuhiko Miyagawa deliberately put into
Email::Valid::Loose to allow periods. Look for the "|\." in the
upper regexp to see that same deviation.
One could certainly argue that the top regexp could be re-written more legibly
with "m//x" and comments. But the bottom version is self-documenting
and, for example, doesn't repeat "\x80-\xff" 18 times. Furthermore,
it's much easier to compare the second version against the source BNF grammar
in RFC 822 to judge whether the implementation is sound even before running
tests.
CONFIGURATION¶
This policy allows regexps up to "N" characters long, where
"N" defaults to 60. You can override this to set it to a different
number with the "max_characters" setting. To do this, put entries in
a
.perlcriticrc file like this:
[RegularExpressions::ProhibitComplexRegexes]
max_characters = 40
CREDITS¶
Initial development of this policy was supported by a grant from the Perl
Foundation.
AUTHOR¶
Chris Dolan <cdolan@cpan.org>
COPYRIGHT¶
Copyright (c) 2007-2011 Chris Dolan. Many rights reserved.
This program is free software; you can redistribute it and/or modify it under
the same terms as Perl itself. The full text of this license can be found in
the LICENSE file included with this module