NAME¶
Search::QueryParser - parses a query string into a data structure suitable for
external search engines
SYNOPSIS¶
my $qp = new Search::QueryParser;
my $s = '+mandatoryWord -excludedWord +field:word "exact phrase"';
my $query = $qp->parse($s) or die "Error in query : " . $qp->err;
$someIndexer->search($query);
# query with comparison operators and implicit plus (second arg is true)
$query = $qp->parse("txt~'^foo.*' date>='01.01.2001' date<='02.02.2002'", 1);
# boolean operators (example below is equivalent to "+a +(b c) -d")
$query = $qp->parse("a AND (b OR c) AND NOT d");
# subset of rows
$query = $qp->parse("Id#123,444,555,666 AND (b OR c)");
DESCRIPTION¶
This module parses a query string into a data structure to be handled by
external search engines. For examples of such engines, see File::Tabular and
Search::Indexer.
The query string can contain simple terms, "exact phrases", field
names and comparison operators, '+/-' prefixes, parentheses, and boolean
connectors.
The parser can be parameterized by regular expressions for specific notions of
"term", "field name" or "operator" ; see the new
method. The parser has no support for lemmatization or other term
transformations : these should be done externally, before passing the query
data structure to the search engine.
The data structure resulting from a parsed query is a tree of terms and
operators, as described below in the parse method. The interpretation of the
structure is up to the external search engine that will receive the parsed
query ; the present module does not make any assumption about what it means to
be "equal" or to "contain" a term.
QUERY STRING¶
The query string is decomposed into "items", where each item has an
optional sign prefix, an optional field name and comparison operator, and a
mandatory value.
Sign prefix¶
Prefix '+' means that the item is mandatory. Prefix '-' means that the item must
be excluded. No prefix means that the item will be searched for, but is not
mandatory.
As far as the result set is concerned, "+a +b c" is strictly
equivalent to "+a +b" : the search engine will return documents
containing both terms 'a' and 'b', and possibly also term 'c'. However, if the
search engine also returns relevance scores, query "+a +b c" might
give a better score to documents containing also term 'c'.
See also section "Boolean connectors" below, which is another way to
combine items into a query.
Field name and comparison operator¶
Internally, each query item has a field name and comparison operator; if not
written explicitly in the query, these take default values '' (empty field
name) and ':' (colon operator).
Operators have a left operand (the field name) and a right operand (the value to
be compared with); for example, "foo:bar" means "search
documents containing term 'bar' in field 'foo'", whereas
"foo=bar" means "search documents where field 'foo' has exact
value 'bar'".
Here is the list of admitted operators with their intended meaning :
- ":"
- treat value as a term to be searched within field. This is
the default operator.
- "~" or "=~"
- treat value as a regex; match field against the regex.
- "!~"
- negation of above
- "==" or "=", "<=",
">=", "!=", "<", ">"
- classical relational operators
- "#"
- Inclusion in the set of comma-separated integers supplied
on the right-hand side.
Operators ":", "~", "=~", "!~" and
"#" admit an empty left operand (so the field name will be '').
Search engines will usually interpret this as "any field" or
"the whole data record".
Value¶
A value (right operand to a comparison operator) can be
- •
- just a term (as recognized by regex "rxTerm", see
new method below)
- •
- A quoted phrase, i.e. a collection of terms within single
or double quotes.
Quotes can be used not only for "exact phrases", but also to
prevent misinterpretation of some values : for example "-2"
would mean "value '2' with prefix '-'", in other words
"exclude term '2'", so if you want to search for value -2, you
should write "-2" instead. In the last example of the synopsis,
quotes were used to prevent splitting of dates into several search
terms.
- •
- a subquery within parentheses. Field names and operators
distribute over parentheses, so for example "foo:(bar bie)" is
equivalent to "foo:bar foo:bie". Nested field names such as
"foo:(bar:bie)" are not allowed. Sign prefixes do not distribute
: "+(foo bar) +bie" is not equivalent to "+foo +bar
+bie".
Boolean connectors¶
Queries can contain boolean connectors 'AND', 'OR', 'NOT' (or their equivalent
in some other languages). This is mere syntactic sugar for the '+' and '-'
prefixes : "a AND b" is translated into "+a +b"; "a
OR b" is translated into "(a b)"; "NOT a" is
translated into "-a". "+a OR b" does not make sense, but
it is translated into "(a b)", under the assumption that the user
understands "OR" better than a '+' prefix. "-a OR b" does
not make sense either, but has no meaningful approximation, so it is rejected.
Combinations of AND/OR clauses must be surrounded by parentheses, i.e. "(a
AND b) OR c" or "a AND (b OR c)" are allowed, but "a AND b
OR c" is not.
METHODS¶
- new
-
new(rxTerm => qr/.../, rxOp => qr/.../, ...)
Creates a new query parser, initialized with (optional) regular expressions
:
- rxTerm
- Regular expression for matching a term. Of course it should
not match the empty string. Default value is "qr/[^\s()]+/". A
term should not be allowed to include parenthesis, otherwise the parser
might get into trouble.
- rxField
- Regular expression for matching a field name. Default value
is "qr/\w+/" (meaning of "\w" according to "use
locale").
- rxOp
- Regular expression for matching an operator. Default value
is "qr/==|<=|>=|!=|=~|!~|:|=|<|>|~/". Note that the
longest operators come first in the regex, because "alternatives are
tried from left to right" (see "Version 8 Regular
Expressions" in perlre) : this is to avoid "a<=3" being
parsed as "a < '=3'".
- rxOpNoField
- Regular expression for a subset of the operators which
admit an empty left operand (no field name). Default value is
"qr/=~|!~|~|:/". Such operators can be meaningful for
comparisons with "any field" or with "the whole
record" ; the precise interpretation depends on the search
engine.
- rxAnd
- Regular expression for boolean connector AND. Default value
is "qr/AND|ET|UND|E/".
- rxOr
- Regular expression for boolean connector OR. Default value
is "qr/OR|OU|ODER|O/".
- rxNot
- Regular expression for boolean connector NOT. Default value
is "qr/NOT|PAS|NICHT|NON/".
- defField
- If no field is specified in the query, use defField.
The default is the empty string "".
- parse
-
$q = $queryParser->parse($queryString, $implicitPlus);
Returns a data structure corresponding to the parsed string. The second
argument is optional; if true, it adds an implicit '+' in front of each
term without prefix, so "parse("+a b c -d", 1)" is
equivalent to "parse("+a +b +c -d")". This is often
seen in common WWW search engines as an option "match all
words".
The return value has following structure :
{ '+' => [{field=>'f1', op=>':', value=>'v1', quote=>'q1'},
{field=>'f2', op=>':', value=>'v2', quote=>'q2'}, ...],
'' => [...],
'-' => [...]
}
In other words, it is a hash ref with 3 keys '+', '' and '-', corresponding
to the 3 sign prefixes (mandatory, ordinary or excluded items). Each key
holds either a ref to an array of items, or "undef" (no items
with this prefix in the query).
An item is a hash ref containing
- "field"
- scalar, field name (may be the empty string)
- "op"
- scalar, operator
- "quote"
- scalar, character that was used for quoting the value
('"', "'" or undef)
- "value"
- Either
- •
- a scalar (simple term), or
- •
- a recursive ref to another query structure. In that case,
"op" is necessarily '()' ; this corresponds to a subquery in
parentheses.
In case of a parsing error, "parse" returns "undef"; method
err can be called to get an explanatory message.
- err
-
$msg = $queryParser->err;
Message describing the last parse error
- unparse
-
$s = $queryParser->unparse($query);
Returns a string representation of the $query data structure.
AUTHOR¶
Laurent Dami, <laurent.dami AT etat ge ch>
COPYRIGHT AND LICENSE¶
Copyright (C) 2005, 2007 by Laurent Dami.
This library is free software; you can redistribute it and/or modify it under
the same terms as Perl itself.