.\" Automatically generated by Pod::Man 4.14 (Pod::Simple 3.40)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings.  \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote.  \*(C+ will
.\" give a nicer C++.  Capital omega is used to do unbreakable dashes and
.\" therefore won't be available.  \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
.    ds -- \(*W-
.    ds PI pi
.    if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
.    if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\"  diablo 12 pitch
.    ds L" ""
.    ds R" ""
.    ds C` ""
.    ds C' ""
'br\}
.el\{\
.    ds -- \|\(em\|
.    ds PI \(*p
.    ds L" ``
.    ds R" ''
.    ds C`
.    ds C'
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\"
.\" If the F register is >0, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD.  Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.\"
.\" Avoid warning from groff about undefined register 'F'.
.de IX
..
.nr rF 0
.if \n(.g .if rF .nr rF 1
.if (\n(rF:(\n(.g==0)) \{\
.    if \nF \{\
.        de IX
.        tm Index:\\$1\t\\n%\t"\\$2"
..
.        if !\nF==2 \{\
.            nr % 0
.            nr F 2
.        \}
.    \}
.\}
.rr rF
.\" ========================================================================
.\"
.IX Title "Approx 3pm"
.TH Approx 3pm "2020-11-09" "perl v5.32.0" "User Contributed Perl Documentation"
.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH "NAME"
String::Approx \- Perl extension for approximate matching (fuzzy matching)
.SH "SYNOPSIS"
.IX Header "SYNOPSIS"
.Vb 1
\&  use String::Approx \*(Aqamatch\*(Aq;
\&
\&  print if amatch("foobar");
\&
\&  my @matches = amatch("xyzzy", @inputs);
\&
\&  my @catches = amatch("plugh", [\*(Aq2\*(Aq], @inputs);
.Ve
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
String::Approx lets you match and substitute strings approximately.
With this you can emulate errors: typing errors, spelling errors,
closely related vocabularies (colour color), genetic mutations (\s-1GAG
ACT\s0), abbreviations (McScot, MacScot).
.PP
\&\s-1NOTE:\s0 String::Approx suits the task of \fBstring matching\fR, not 
\&\fBstring comparison\fR, and it works for \fBstrings\fR, not for \fBtext\fR.
.PP
If you want to compare strings for similarity, you probably just want
the Levenshtein edit distance (explained below), the Text::Levenshtein
and Text::LevenshteinXS modules in \s-1CPAN.\s0  See also Text::WagnerFischer
and Text::PhraseDistance.  (There are functions for this in String::Approx,
e.g. \fBadist()\fR, but their results sometimes differ from the bare Levenshtein
et al.)
.PP
If you want to compare things like text or source code, consisting of
\&\fBwords\fR or \fBtokens\fR and \fBphrases\fR and \fBsentences\fR, or
\&\fBexpressions\fR and \fBstatements\fR, you should probably use some other
tool than String::Approx, like for example the standard \s-1UNIX\s0 \fBdiff\fR\|(1)
tool, or the Algorithm::Diff module from \s-1CPAN.\s0
.PP
The measure of \fBapproximateness\fR is the \fILevenshtein edit distance\fR.
It is the total number of \*(L"edits\*(R": insertions,
.PP
.Vb 1
\&        word world
.Ve
.PP
deletions,
.PP
.Vb 1
\&        monkey money
.Ve
.PP
and substitutions
.PP
.Vb 1
\&        sun fun
.Ve
.PP
required to transform a string to another string.  For example, to
transform \fI\*(L"lead\*(R"\fR into \fI\*(L"gold\*(R"\fR, you need three edits:
.PP
.Vb 1
\&        lead gead goad gold
.Ve
.PP
The edit distance of \*(L"lead\*(R" and \*(L"gold\*(R" is therefore three, or 75%.
.PP
\&\fBString::Approx\fR uses the Levenshtein edit distance as its measure, but
String::Approx is not well-suited for comparing strings of different
length, in other words, if you want a \*(L"fuzzy eq\*(R", see above.
String::Approx is more like regular expressions or \fBindex()\fR, it finds
substrings that are close matches.>
.SH "MATCH"
.IX Header "MATCH"
.Vb 1
\&        use String::Approx \*(Aqamatch\*(Aq;
\&
\&        $matched     = amatch("pattern") 
\&        $matched     = amatch("pattern", [ modifiers ])
\&
\&        $any_matched = amatch("pattern", @inputs) 
\&        $any_matched = amatch("pattern", [ modifiers ], @inputs)
\&
\&        @match       = amatch("pattern") 
\&        @match       = amatch("pattern", [ modifiers ])
\&
\&        @matches     = amatch("pattern", @inputs) 
\&        @matches     = amatch("pattern", [ modifiers ], @inputs)
.Ve
.PP
Match \fBpattern\fR approximately.  In list context return the matched
\&\fB\f(CB@inputs\fB\fR.  If no inputs are given, match against the \fB\f(CB$_\fB\fR.  In scalar
context return true if \fIany\fR of the inputs match, false if none match.
.PP
Notice that the pattern is a string.  Not a regular expression.  None
of the regular expression notations (^, ., *, and so on) work.  They
are characters just like the others.  Note-on-note: some limited form
of \fI\*(L"regular expressionism\*(R"\fR is planned in future: for example
character classes ([abc]) and \fIany-chars\fR (.).  But that feature will
be turned on by a special \fImodifier\fR (just a guess: \*(L"r\*(R"), so there
should be no backward compatibility problem.
.PP
Notice also that matching is not symmetric.  The inputs are matched
against the pattern, not the other way round.  In other words: the
pattern can be a substring, a submatch, of an input element.  An input
element is always a superstring of the pattern.
.SS "\s-1MODIFIERS\s0"
.IX Subsection "MODIFIERS"
With the modifiers you can control the amount of approximateness and
certain other control variables.  The modifiers are one or more
strings, for example \fB\*(L"i\*(R"\fR, within a string optionally separated by
whitespace.  The modifiers are inside an anonymous array: the \fB[ ]\fR
in the syntax are not notational, they really do mean \fB[ ]\fR, for
example \fB[ \*(L"i\*(R", \*(L"2\*(R" ]\fR.  \fB[\*(L"2 i\*(R"]\fR would be identical.
.PP
The implicit default approximateness is 10%, rounded up.  In other
words: every tenth character in the pattern may be an error, an edit.
You can explicitly set the maximum approximateness by supplying a
modifier like
.PP
.Vb 2
\&        number
\&        number%
.Ve
.PP
Examples: \fB\*(L"3\*(R"\fR, \fB\*(L"15%\*(R"\fR.
.PP
Note that \f(CW\*(C`0%\*(C'\fR is not rounded up, it is equal to \f(CW0\fR.
.PP
Using a similar syntax you can separately control the maximum number
of insertions, deletions, and substitutions by prefixing the numbers
with I, D, or S, like this:
.PP
.Vb 6
\&        Inumber
\&        Inumber%
\&        Dnumber
\&        Dnumber%
\&        Snumber
\&        Snumber%
.Ve
.PP
Examples: \fB\*(L"I2\*(R"\fR, \fB\*(L"D20%\*(R"\fR, \fB\*(L"S0\*(R"\fR.
.PP
You can ignore case (\fB\*(L"A\*(R"\fR becames equal to \fB\*(L"a\*(R"\fR and vice versa)
by adding the \fB\*(L"i\*(R"\fR modifier.
.PP
For example
.PP
.Vb 1
\&        [ "i 25%", "S0" ]
.Ve
.PP
means \fIignore case\fR, \fIallow every fourth character to be \*(L"an edit\*(R"\fR,
but allow \fIno substitutions\fR.  (See \s-1NOTES\s0 about disallowing
substitutions or insertions.)
.PP
\&\s-1NOTE:\s0 setting \f(CW\*(C`I0 D0 S0\*(C'\fR is not equivalent to using \fBindex()\fR.
If you want to use \fBindex()\fR, use \fBindex()\fR.
.SH "SUBSTITUTE"
.IX Header "SUBSTITUTE"
.Vb 1
\&        use String::Approx \*(Aqasubstitute\*(Aq;
\&
\&        @substituted = asubstitute("pattern", "replacement")
\&        @substituted = asubstitute("pattern", "replacement", @inputs) 
\&        @substituted = asubstitute("pattern", "replacement", [ modifiers ])
\&        @substituted = asubstitute("pattern", "replacement",
\&                                   [ modifiers ], @inputs)
.Ve
.PP
Substitute approximate \fBpattern\fR with \fBreplacement\fR and return as a
list <copies> of \fB\f(CB@inputs\fB\fR, the substitutions having been made on the
elements that did match the pattern.  If no inputs are given,
substitute in the \fB\f(CB$_\fB\fR.  The replacement can contain magic strings
\&\fB$&\fR, \fB$`\fR, \fB$'\fR that stand for the matched string, the string
before it, and the string after it, respectively.  All the other
arguments are as in \f(CW\*(C`amatch()\*(C'\fR, plus one additional modifier, \fB\*(L"g\*(R"\fR
which means substitute globally (all the matches in an element and not
just the first one, as is the default).
.PP
See \*(L"\s-1BAD NEWS\*(R"\s0 about the unfortunate stinginess of \f(CW\*(C`asubstitute()\*(C'\fR.
.SH "INDEX"
.IX Header "INDEX"
.Vb 1
\&        use String::Approx \*(Aqaindex\*(Aq;
\&
\&        $index   = aindex("pattern")
\&        @indices = aindex("pattern", @inputs)
\&        $index   = aindex("pattern", [ modifiers ])
\&        @indices = aindex("pattern", [ modifiers ], @inputs)
.Ve
.PP
Like \f(CW\*(C`amatch()\*(C'\fR but returns the index/indices at which the pattern
matches approximately.  In list context and if \f(CW@inputs\fR are used,
returns a list of indices, one index for each input element.
If there's no approximate match, \f(CW\*(C`\-1\*(C'\fR is returned as the index.
.PP
\&\s-1NOTE:\s0 if there is character repetition (e.g. \*(L"aa\*(R") either in
the pattern or in the text, the returned index might start 
\&\*(L"too early\*(R".  This is consistent with the goal of the module
of matching \*(L"as early as possible\*(R", just like regular expressions
(that there might be a \*(L"less approximate\*(R" match starting later is
of somewhat irrelevant).
.PP
There's also backwards-scanning \f(CW\*(C`arindex()\*(C'\fR.
.SH "SLICE"
.IX Header "SLICE"
.Vb 1
\&        use String::Approx \*(Aqaslice\*(Aq;
\&
\&        ($index, $size)   = aslice("pattern")
\&        ([$i0, $s0], ...) = aslice("pattern", @inputs)
\&        ($index, $size)   = aslice("pattern", [ modifiers ])
\&        ([$i0, $s0], ...) = aslice("pattern", [ modifiers ], @inputs)
.Ve
.PP
Like \f(CW\*(C`aindex()\*(C'\fR but returns also the size (length) of the match.
If the match fails, returns an empty list (when matching against \f(CW$_\fR)
or an empty anonymous list corresponding to the particular input.
.PP
\&\s-1NOTE:\s0 size of the match will very probably be something you did not
expect (such as longer than the pattern, or a negative number).  This
may or may not be fixed in future releases. Also the beginning of the
match may vary from the expected as with \fBaindex()\fR, see above.
.PP
If the modifier
.PP
.Vb 1
\&        "minimal_distance"
.Ve
.PP
is used, the minimal possible edit distance is returned as the
third element:
.PP
.Vb 2
\&        ($index, $size, $distance) = aslice("pattern", [ modifiers ])
\&        ([$i0, $s0, $d0], ...)     = aslice("pattern", [ modifiers ], @inputs)
.Ve
.SH "DISTANCE"
.IX Header "DISTANCE"
.Vb 1
\&        use String::Approx \*(Aqadist\*(Aq;
\&
\&        $dist = adist("pattern", $input);
\&        @dist = adist("pattern", @input);
.Ve
.PP
Return the \fIedit distance\fR or distances between the pattern and the
input or inputs.  Zero edit distance means exact match.  (Remember
that the match can 'float' in the inputs, the match is a substring
match.)  If the pattern is longer than the input or inputs, the
returned distance or distances is or are negative.
.PP
.Vb 1
\&        use String::Approx \*(Aqadistr\*(Aq;
\&
\&        $dist = adistr("pattern", $input);
\&        @dist = adistr("pattern", @inputs);
.Ve
.PP
Return the \fBrelative\fR \fIedit distance\fR or distances between the
pattern and the input or inputs.  Zero relative edit distance means
exact match, one means completely different.  (Remember that the
match can 'float' in the inputs, the match is a substring match.)  If
the pattern is longer than the input or inputs, the returned distance
or distances is or are negative.
.PP
You can use \fBadist()\fR or \fBadistr()\fR to sort the inputs according to their
approximateness:
.PP
.Vb 3
\&        my %d;
\&        @d{@inputs} = map { abs } adistr("pattern", @inputs);
\&        my @d = sort { $d{$a} <=> $d{$b} } @inputs;
.Ve
.PP
Now \f(CW@d\fR contains the inputs, the most like \f(CW"pattern"\fR first.
.SH "CONTROLLING THE CACHE"
.IX Header "CONTROLLING THE CACHE"
\&\f(CW\*(C`String::Approx\*(C'\fR maintains a \s-1LU\s0 (least-used) cache that holds the
\&'matching engines' for each instance of a \fIpattern+modifiers\fR.  The
cache is intended to help the case where you match a small set of
patterns against a large set of string.  However, the more engines you
cache the more you eat memory.  If you have a lot of different
patterns or if you have a lot of memory to burn, you may want to
control the cache yourself.  For example, allowing a larger cache
consumes more memory but probably runs a little bit faster since the
cache fills (and needs flushing) less often.
.PP
The cache has two parameters: \fImax\fR and \fIpurge\fR.  The first one
is the maximum size of the cache and the second one is the cache
flushing ratio: when the number of cache entries exceeds \fImax\fR,
\&\fImax\fR times \fIpurge\fR cache entries are flushed.  The default
values are 1000 and 0.75, respectively, which means that when
the 1001st entry would be cached, 750 least used entries will
be removed from the cache.  To access the parameters you can
use the calls
.PP
.Vb 2
\&        $now_max = String::Approx::cache_max();
\&        String::Approx::cache_max($new_max);
\&
\&        $now_purge = String::Approx::cache_purge();
\&        String::Approx::cache_purge($new_purge);
\&
\&        $limit = String::Approx::cache_n_purge();
.Ve
.PP
To be honest, there are actually \fBtwo\fR caches: the first one is used
far the patterns with no modifiers, the second one for the patterns
with pattern modifiers.  Using the standard parameters you will
therefore actually cache up to 2000 entries.  The above calls control
both caches for the same price.
.PP
To disable caching completely use
.PP
.Vb 1
\&        String::Approx::cache_disable();
.Ve
.PP
Note that this doesn't flush any possibly existing cache entries,
to do that use
.PP
.Vb 1
\&        String::Approx::cache_flush_all();
.Ve
.SH "NOTES"
.IX Header "NOTES"
Because matching is by \fIsubstrings\fR, not by whole strings, insertions
and substitutions produce often very similar results: \*(L"abcde\*(R" matches
\&\*(L"axbcde\*(R" either by insertion \fBor\fR substitution of \*(L"x\*(R".
.PP
The maximum edit distance is also the maximum number of edits.
That is, the \fB\*(L"I2\*(R"\fR in
.PP
.Vb 1
\&        amatch("abcd", ["I2"])
.Ve
.PP
is useless because the maximum edit distance is (implicitly) 1.
You may have meant to say
.PP
.Vb 1
\&        amatch("abcd", ["2D1S1"])
.Ve
.PP
or something like that.
.PP
If you want to simulate transposes
.PP
.Vb 1
\&        feet fete
.Ve
.PP
you need to allow at least edit distance of two because in terms of
our edit primitives a transpose is first one deletion and then one
insertion.
.SS "\s-1TEXT POSITION\s0"
.IX Subsection "TEXT POSITION"
The starting and ending positions of matching, substituting, indexing, or
slicing can be changed from the beginning and end of the input(s) to
some other positions by using either or both of the modifiers
.PP
.Vb 2
\&        "initial_position=24"
\&        "final_position=42"
.Ve
.PP
or the both the modifiers
.PP
.Vb 2
\&        "initial_position=24"
\&        "position_range=10"
.Ve
.PP
By setting the \fB\*(L"position_range\*(R"\fR to be zero you can limit
(anchor) the operation to happen only once (if a match is possible)
at the position.
.SH "VERSION"
.IX Header "VERSION"
Major release 3.
.SH "CHANGES FROM VERSION 2"
.IX Header "CHANGES FROM VERSION 2"
.SS "\s-1GOOD NEWS\s0"
.IX Subsection "GOOD NEWS"
.IP "The version 3 is 2\-3 times faster than version 2" 4
.IX Item "The version 3 is 2-3 times faster than version 2"
.PD 0
.IP "No pattern length limitation" 4
.IX Item "No pattern length limitation"
.PD
The algorithm is independent on the pattern length: its time
complexity is \fIO(kn)\fR, where \fIk\fR is the number of edits and \fIn\fR the
length of the text (input).  The preprocessing of the pattern will of
course take some \fIO(m)\fR (\fIm\fR being the pattern length) time, but
\&\f(CW\*(C`amatch()\*(C'\fR and \f(CW\*(C`asubstitute()\*(C'\fR cache the result of this
preprocessing so that it is done only once per pattern.
.SS "\s-1BAD NEWS\s0"
.IX Subsection "BAD NEWS"
.IP "You do need a C compiler to install the module" 4
.IX Item "You do need a C compiler to install the module"
Perl's regular expressions are no more used; instead a faster and more
scalable algorithm written in C is used.
.ie n .IP """asubstitute()"" is now always stingy" 4
.el .IP "\f(CWasubstitute()\fR is now always stingy" 4
.IX Item "asubstitute() is now always stingy"
The string matched and substituted is now always stingy, as short
as possible.  It used to be as long as possible.  This is an unfortunate
change stemming from switching the matching algorithm.  Example: with
edit distance of two and substituting for \fB\*(L"word\*(R"\fR from \fB\*(L"cork\*(R"\fR and
\&\fB\*(L"wool\*(R"\fR previously did match \fB\*(L"cork\*(R"\fR and \fB\*(L"wool\*(R"\fR.  Now it does
match \fB\*(L"or\*(R"\fR and \fB\*(L"wo\*(R"\fR.  As little as possible, or, in other words,
with as much approximateness, as many edits, as possible.  Because
there is no \fIneed\fR to match the \fB\*(L"c\*(R"\fR of \fB\*(L"cork\*(R"\fR, it is not matched.
.ie n .IP "no more ""aregex()"" because regular expressions are no more used" 4
.el .IP "no more \f(CWaregex()\fR because regular expressions are no more used" 4
.IX Item "no more aregex() because regular expressions are no more used"
.PD 0
.ie n .IP "no more ""compat1"" for String::Approx version 1 compatibility" 4
.el .IP "no more \f(CWcompat1\fR for String::Approx version 1 compatibility" 4
.IX Item "no more compat1 for String::Approx version 1 compatibility"
.PD
.SH "ACKNOWLEDGEMENTS"
.IX Header "ACKNOWLEDGEMENTS"
The following people have provided valuable test cases, documentation
clarifications, and other feedback:
.PP
Jared August, Arthur Bergman, Anirvan Chatterjee, Steve A. Chervitz,
Aldo Calpini, David Curiel, Teun van den Dool, Alberto Fontaneda,
Rob Fugina, Dmitrij Frishman, Lars Gregersen, Kevin Greiner,
B. Elijah Griffin, Mike Hanafey, Mitch Helle, Ricky Houghton,
\&'idallen', Helmut Jarausch, Damian Keefe, Ben Kennedy, Craig Kelley,
Franz Kirsch, Dag Kristian, Mark Land, J. D. Laub, John P. Linderman,
Tim Maher, Juha Muilu, Sergey Novoselov, Andy Oram, Ji Y Park,
Eric Promislow, Nikolaus Rath, Stefan Ram, Slaven Rezic,
Dag Kristian Rognlien, Stewart Russell, Slaven Rezic, Chris Rosin,
Pasha Sadri, Ilya Sandler, Bob J.A. Schijvenaars, Ross Smith,
Frank Tobin, Greg Ward, Rich Williams, Rick Wise.
.PP
The matching algorithm was developed by Udi Manber, Sun Wu, and Burra
Gopal in the Department of Computer Science, University of Arizona.
.SH "AUTHOR"
.IX Header "AUTHOR"
Jarkko Hietaniemi <jhi@iki.fi>
.SH "COPYRIGHT AND LICENSE"
.IX Header "COPYRIGHT AND LICENSE"
Copyright 2001\-2013 by Jarkko Hietaniemi
.PP
This library is free software; you can redistribute it and/or modify
under either the terms of the Artistic License 2.0, or the \s-1GNU\s0 Library
General Public License, Version 2.  See the files Artistic and \s-1LGPL\s0
for more details.
.PP
Furthermore: no warranties or obligations of any kind are given, and
the separate file \fI\s-1COPYRIGHT\s0\fR must be included intact in all copies
and derived materials.