NAME¶
Regexp::Common - Provide commonly requested regular expressions
SYNOPSIS¶
# STANDARD USAGE
use Regexp::Common;
while (<>) {
/$RE{num}{real}/ and print q{a number};
/$RE{quoted}/ and print q{a ['"`] quoted string};
/$RE{delimited}{-delim=>'/'}/ and print q{a /.../ sequence};
/$RE{balanced}{-parens=>'()'}/ and print q{balanced parentheses};
/$RE{profanity}/ and print q{a #*@%-ing word};
}
# SUBROUTINE-BASED INTERFACE
use Regexp::Common 'RE_ALL';
while (<>) {
$_ =~ RE_num_real() and print q{a number};
$_ =~ RE_quoted() and print q{a ['"`] quoted string};
$_ =~ RE_delimited(-delim=>'/') and print q{a /.../ sequence};
$_ =~ RE_balanced(-parens=>'()'} and print q{balanced parentheses};
$_ =~ RE_profanity() and print q{a #*@%-ing word};
}
# IN-LINE MATCHING...
if ( $RE{num}{int}->matches($text) ) {...}
# ...AND SUBSTITUTION
my $cropped = $RE{ws}{crop}->subs($uncropped);
# ROLL-YOUR-OWN PATTERNS
use Regexp::Common 'pattern';
pattern name => ['name', 'mine'],
create => '(?i:J[.]?\s+A[.]?\s+Perl-Hacker)',
;
my $name_matcher = $RE{name}{mine};
pattern name => [ 'lineof', '-char=_' ],
create => sub {
my $flags = shift;
my $char = quotemeta $flags->{-char};
return '(?:^$char+$)';
},
match => sub {
my ($self, $str) = @_;
return $str !~ /[^$self->{flags}{-char}]/;
},
subs => sub {
my ($self, $str, $replacement) = @_;
$_[1] =~ s/^$self->{flags}{-char}+$//g;
},
;
my $asterisks = $RE{lineof}{-char=>'*'};
# DECIDING WHICH PATTERNS TO LOAD.
use Regexp::Common qw /comment number/; # Comment and number patterns.
use Regexp::Common qw /no_defaults/; # Don't load any patterns.
use Regexp::Common qw /!delimited/; # All, but delimited patterns.
DESCRIPTION¶
By default, this module exports a single hash (%RE) that stores or generates
commonly needed regular expressions (see "List of available
patterns").
There is an alternative, subroutine-based syntax described in
"Subroutine-based interface".
General syntax for requesting patterns¶
To access a particular pattern, %RE is treated as a hierarchical hash of hashes
(of hashes...), with each successive key being an identifier. For example, to
access the pattern that matches real numbers, you specify:
$RE{num}{real}
and to access the pattern that matches integers:
$RE{num}{int}
Deeper layers of the hash are used to specify
flags: arguments that
modify the resulting pattern in some way. The keys used to access these layers
are prefixed with a minus sign and may have a value; if a value is given, it's
done by using a multidimensional key. For example, to access the pattern that
matches base-2 real numbers with embedded commas separating groups of three
digits (e.g. 10,101,110.110101101):
$RE{num}{real}{-base => 2}{-sep => ','}{-group => 3}
Through the magic of Perl, these flag layers may be specified in any order (and
even interspersed through the identifier keys!) so you could get the same
pattern with:
$RE{num}{real}{-sep => ','}{-group => 3}{-base => 2}
or:
$RE{num}{-base => 2}{real}{-group => 3}{-sep => ','}
or even:
$RE{-base => 2}{-group => 3}{-sep => ','}{num}{real}
etc.
Note, however, that the relative order of amongst the identifier keys
is
significant. That is:
$RE{list}{set}
would not be the same as:
$RE{set}{list}
Flag syntax¶
In versions prior to 2.113, flags could also be written as
"{"-flag=value"}". This no longer works, although
"{"-flag$;value"}" still does. However, "{-flag =>
'value'}" is the preferred syntax.
Universal flags¶
Normally, flags are specific to a single pattern. However, there is two flags
that all patterns may specify.
- "-keep"
- By default, the patterns provided by %RE contain no
capturing parentheses. However, if the "-keep" flag is specified
(it requires no value) then any significant substrings that the pattern
matches are captured. For example:
if ($str =~ $RE{num}{real}{-keep}) {
$number = $1;
$whole = $3;
$decimals = $5;
}
Special care is needed if a "kept" pattern is interpolated into a
larger regular expression, as the presence of other capturing parentheses
is likely to change the "number variables" into which
significant substrings are saved.
See also "Adding new regular expressions", which describes how to
create new patterns with "optional" capturing brackets that
respond to "-keep".
- "-i"
- Some patterns or subpatterns only match lowercase or
uppercase letters. If one wants the do case insensitive matching, one
option is to use the "/i" regexp modifier, or the special
sequence "(?i)". But if the functional interface is used, one
does not have this option. The "-i" switch solves this problem;
by using it, the pattern will do case insensitive matching.
OO interface and inline matching/substitution¶
The patterns returned from %RE are objects, so rather than writing:
if ($str =~ /$RE{some}{pattern}/ ) {...}
you can write:
if ( $RE{some}{pattern}->matches($str) ) {...}
For matching this would seem to have no great advantage apart from readability
(but see below).
For substitutions, it has other significant benefits. Frequently you want to
perform a substitution on a string without changing the original. Most people
use this:
$changed = $original;
$changed =~ s/$RE{some}{pattern}/$replacement/;
The more adept use:
($changed = $original) =~ s/$RE{some}{pattern}/$replacement/;
Regexp::Common allows you do write this:
$changed = $RE{some}{pattern}->subs($original=>$replacement);
Apart from reducing precedence-angst, this approach has the added advantages
that the substitution behaviour can be optimized from the regular expression,
and the replacement string can be provided by default (see "Adding new
regular expressions").
For example, in the implementation of this substitution:
$cropped = $RE{ws}{crop}->subs($uncropped);
the default empty string is provided automatically, and the substitution is
optimized to use:
$uncropped =~ s/^\s+//;
$uncropped =~ s/\s+$//;
rather than:
$uncropped =~ s/^\s+|\s+$//g;
Subroutine-based interface¶
The hash-based interface was chosen because it allows regexes to be effortlessly
interpolated, and because it also allows them to be "curried". For
example:
my $num = $RE{num}{int};
my $commad = $num->{-sep=>','}{-group=>3};
my $duodecimal = $num->{-base=>12};
However, the use of tied hashes does make the access to Regexp::Common patterns
slower than it might otherwise be. In contexts where impatience overrules
laziness, Regexp::Common provides an additional subroutine-based interface.
For each (sub-)entry in the %RE hash ($RE{key1}{key2}{etc}), there is a
corresponding exportable subroutine: "RE_key1_key2_etc()". The name
of each subroutine is the underscore-separated concatenation of the
non-flag keys that locate the same pattern in %RE. Flags are passed to
the subroutine in its argument list. Thus:
use Regexp::Common qw( RE_ws_crop RE_num_real RE_profanity );
$str =~ RE_ws_crop() and die "Surrounded by whitespace";
$str =~ RE_num_real(-base=>8, -sep=>" ") or next;
$offensive = RE_profanity(-keep);
$str =~ s/$offensive/$bad{$1}++; "<expletive deleted>"/ge;
Note that, unlike the hash-based interface (which returns objects), these
subroutines return ordinary "qr"'d regular expressions. Hence they
do not curry, nor do they provide the OO match and substitution inlining
described in the previous section.
It is also possible to export subroutines for all available patterns like so:
use Regexp::Common 'RE_ALL';
Or you can export all subroutines with a common prefix of keys like so:
use Regexp::Common 'RE_num_ALL';
which will export "RE_num_int" and "RE_num_real" (and if you
have create more patterns who have first key
num, those will be
exported as well). In general,
RE_key1_..._keyn_ALL will export all
subroutines whose pattern names have first keys
key1 ...
keyn.
Adding new regular expressions¶
You can add your own regular expressions to the %RE hash at run-time, using the
exportable "pattern" subroutine. It expects a hash-like list of
key/value pairs that specify the behaviour of the pattern. The various
possible argument pairs are:
- "name => [ @list ]"
- A required argument that specifies the name of the pattern,
and any flags it may take, via a reference to a list of strings. For
example:
pattern name => [qw( line of -char )],
# other args here
;
This specifies an entry $RE{line}{of}, which may take a "-char"
flag.
Flags may also be specified with a default value, which is then used
whenever the flag is specified without an explicit value (but not when the
flag is omitted). For example:
pattern name => [qw( line of -char=_ )],
# default char is '_'
# other args here
;
- "create => $sub_ref_or_string"
- A required argument that specifies either a string that is
to be returned as the pattern:
pattern name => [qw( line of underscores )],
create => q/(?:^_+$)/
;
or a reference to a subroutine that will be called to create the pattern:
pattern name => [qw( line of -char=_ )],
create => sub {
my ($self, $flags) = @_;
my $char = quotemeta $flags->{-char};
return '(?:^$char+$)';
},
;
If the subroutine version is used, the subroutine will be called with three
arguments: a reference to the pattern object itself, a reference to a hash
containing the flags and their values, and a reference to an array
containing the non-flag keys.
Whatever the subroutine returns is stringified as the pattern.
No matter how the pattern is created, it is immediately postprocessed to
include or exclude capturing parentheses (according to the value of the
"-keep" flag). To specify such "optional" capturing
parentheses within the regular expression associated with
"create", use the notation "(?k:...)". Any parentheses
of this type will be converted to "(...)" when the
"-keep" flag is specified, or "(?:...)" when it is
not. It is a Regexp::Common convention that the outermost capturing
parentheses always capture the entire pattern, but this is not
enforced.
- "match => $sub_ref"
- An optional argument that specifies a subroutine that is to
be called when the "$RE{...}->matches(...)" method of this
pattern is invoked.
The subroutine should expect two arguments: a reference to the pattern
object itself, and the string to be matched against.
It should return the same types of values as a "m/.../" does.
pattern name => [qw( line of -char )],
create => sub {...},
match => sub {
my ($self, $str) = @_;
$str !~ /[^$self->{flags}{-char}]/;
},
;
- "subs => $sub_ref"
- An optional argument that specifies a subroutine that is to
be called when the "$RE{...}->subs(...)" method of this
pattern is invoked.
The subroutine should expect three arguments: a reference to the pattern
object itself, the string to be changed, and the value to be substituted
into it. The third argument may be "undef", indicating the
default substitution is required.
The subroutine should return the same types of values as an
"s/.../.../" does.
For example:
pattern name => [ 'lineof', '-char=_' ],
create => sub {...},
subs => sub {
my ($self, $str, $ignore_replacement) = @_;
$_[1] =~ s/^$self->{flags}{-char}+$//g;
},
;
Note that such a subroutine will almost always need to modify $_[1]
directly.
- "version => $minimum_perl_version"
- If this argument is given, it specifies the minimum version
of perl required to use the new pattern. Attempts to use the pattern with
earlier versions of perl will generate a fatal diagnostic.
Loading specific sets of patterns.¶
By default, all the sets of patterns listed below are made available. However,
it is possible to indicate which sets of patterns should be made available -
the wanted sets should be given as arguments to "use".
Alternatively, it is also possible to indicate which sets of patterns should
not be made available - those sets will be given as argument to the
"use" statement, but are preceded with an exclaimation mark. The
argument
no_defaults indicates none of the default patterns should be
made available. This is useful for instance if all you want is the
"pattern()" subroutine.
Examples:
use Regexp::Common qw /comment number/; # Comment and number patterns.
use Regexp::Common qw /no_defaults/; # Don't load any patterns.
use Regexp::Common qw /!delimited/; # All, but delimited patterns.
It's also possible to load your own set of patterns. If you have a module
"Regexp::Common::my_patterns" that makes patterns available, you can
have it made available with
use Regexp::Common qw /my_patterns/;
Note that the default patterns will still be made available - only if you use
no_defaults, or mention one of the default sets explicitly, the non
mentioned defaults aren't made available.
List of available patterns¶
The patterns listed below are currently available. Each set of patterns has its
own manual page describing the details. For each pattern set named
name, the manual page
Regexp::Common::name describes the
details.
Currently available are:
- Regexp::Common::balanced
- Provides regexes for strings with balanced parenthesized
delimiters.
- Regexp::Common::comment
- Provides regexes for comments of various languages (43
languages currently).
- Regexp::Common::delimited
- Provides regexes for delimited strings.
- Regexp::Common::lingua
- Provides regexes for palindromes.
- Regexp::Common::list
- Provides regexes for lists.
- Regexp::Common::net
- Provides regexes for IPv4 addresses and MAC addresses.
- Regexp::Common::number
- Provides regexes for numbers (integers and reals).
- Regexp::Common::profanity
- Provides regexes for profanity.
- Regexp::Common::whitespace
- Provides regexes for leading and trailing whitespace.
- Regexp::Common::zip
- Provides regexes for zip codes.
Forthcoming patterns and features¶
Future releases of the module will also provide patterns for the following:
* email addresses
* HTML/XML tags
* more numerical matchers,
* mail headers (including multiline ones),
* more URLS
* telephone numbers of various countries
* currency (universal 3 letter format, Latin-1, currency names)
* dates
* binary formats (e.g. UUencoded, MIMEd)
If you have other patterns or pattern generators that you think would be
generally useful, please send them to the maintainer -- preferably as source
code using the "pattern" subroutine. Submissions that include a set
of tests will be especially welcome.
DIAGNOSTICS¶
- "Can't export unknown subroutine %s"
- The subroutine-based interface didn't recognize the
requested subroutine. Often caused by a spelling mistake or an
incompletely specified name.
- "Can't create unknown regex: $RE{...}"
- Regexp::Common doesn't have a generator for the requested
pattern. Often indicates a misspelt or missing parameter.
- "Perl %f does not support the pattern $RE{...}. You
need Perl %f or later"
- The requested pattern requires advanced regex features
(e.g. recursion) that not available in your version of Perl. Time to
upgrade.
- "pattern() requires argument: name => [ @list
]"
- Every user-defined pattern specification must have a
name.
- "pattern() requires argument: create =>
$sub_ref_or_string"
- Every user-defined pattern specification must provide a
pattern creation mechanism: either a pattern string or a reference to a
subroutine that returns the pattern string.
- "Base must be between 1 and 36"
- The $RE{num}{real}{-base=>'N'} pattern uses the
characters [0-9A-Z] to represent the digits of various bases. Hence it
only produces regular expressions for bases up to hexatricensimal.
- "Must specify delimiter in $RE{delimited}"
- The pattern has no default delimiter. You need to write:
$RE{delimited}{-delim=> X'} for some character X
ACKNOWLEDGEMENTS¶
Deepest thanks to the many people who have encouraged and contributed to this
project, especially: Elijah, Jarkko, Tom, Nat, Ed, and Vivek.
Further thanks go to: Alexandr Ciornii, Blair Zajac, Bob Stockdale, Charles
Thomas, Chris Vertonghen, the CPAN Testers, David Hand, Fany, Geoffrey Leach,
Hermann-Marcus Behrens, Jerome Quelin, Jim Cromie, Lars Wilke, Linda Julien,
Mike Arms, Mike Castle, Mikko, Murat Uenalan, Rafaeel Garcia-Suarez, Ron
Savage, Sam Vilain, Slaven Rezic, Smylers, Tim Maher, and all the others I've
forgotten.
AUTHOR¶
Damian Conway (damian@conway.org)
MAINTAINANCE¶
This package is maintained by Abigail (
regexp-common@abigail.be).
BUGS AND IRRITATIONS¶
Bound to be plenty.
For a start, there are many common regexes missing. Send them in to
regexp-common@abigail.be.
There are some POD issues when installing this module using a pre-5.6.0 perl;
some manual pages may not install, or may not install correctly using a perl
that is that old. You might consider upgrading your perl.
NOT A BUG¶
- •
- The various patterns are not anchored. That is, a pattern
like "$RE {num} {int}" will match against "abc4def",
because a substring of the subject matches. This is by design, and not a
bug. If you want the pattern to be anchored, use something like:
my $integer = $RE {num} {int};
$subj =~ /^$integer$/ and print "Matches!\n";
LICENSE and COPYRIGHT¶
This software is Copyright (c) 2001 - 2011, Damian Conway and Abigail.
This module is free software, and maybe used under any of the following
licenses:
1) The Perl Artistic License. See the file COPYRIGHT.AL.
2) The Perl Artistic License 2.0. See the file COPYRIGHT.AL2.
3) The BSD Licence. See the file COPYRIGHT.BSD.
4) The MIT Licence. See the file COPYRIGHT.MIT.