.\" PatMaN DNA pattern matcher .\" (C) 2007 Kay Pruefer, Udo Stenzel .\" .\" This program is free software; you can redistribute it and/or modify .\" it under the terms of the GNU General Public License as published by .\" the Free Software Foundation; either version 2 of the License, or (at .\" your option) any later version. See the LICENSE file for details. .\" Process this file with .\" groff -man -Tascii patman.1 .\" .TH PATMAN 1 "JANUARY 2008" Applications "User Manuals" .SH NAME PatMaN \- search for approximate patterns in DNA libraries .SH SYNOPSIS .B patman [ .I option .B | .I file .B ... ] .SH DESCRIPTION .B PatMaN searches for (small) .I patterns in (huge) DNA .IR databases , allowing for some mismatches and optionally gaps. .I Patterns and .I databases are read from one or more .BR fasta (5) files listed as non-option arguments, depending on whether the .I -D or .I -P option last preceded them, and matched against each other. The output of .B PatMaN is a table containing one line for each match, consisting of tab-separated fields: .IP \(bu 4 name of database sequence, .IP \(bu 4 name of pattern, .IP \(bu 4 position of first matched base in database sequence, the sequence's beginning has position 1, .IP \(bu 4 position of last matched base in database sequence, .IP \(bu 4 strand (+ for literal match, - for reverse complement), .IP \(bu 4 edit distance (number of mismatches plus number of gaps). .SH OPTIONS .IP "-V, --version" Print version number and exit. .IP "-e num, --edits num" Allow up to .I num mismatches and/or gaps per match. .IP "-g num, --gaps num" Allow up to .I num gaps per match. Note that gaps count as mismatches, too, so the .I -e option should always be set at least as high as the .I -g option. Allowing many gaps can incur a considerable computational cost. .IP "-D, --databases" Treat the following files as .IR database . Databases must be in .BR fasta (5) format. Multiple .I database files, including "-" for standard input, are allowed and are read in turn. .IP "-P, --patterns" Treat the following files as .IR patterns . Pattern files must be in .BR fasta (5) format. Multiple .I pattern files, including "-" for standard input, are allowed and are all read before touching the .I databases. .IP "-o file, --output file" Redirect output to .IR file . The file name "-" causes output to be written to stdout, which is also the default .IP "-a, --ambicodes" Activate the interpretation of ambiguity codes in patterns. This results in the expansion of any .I pattern with ambiguity codes into multiple patterns which can match independently. Compare .B Unknown Nucleotides below. .IP "-s, --singlestrand" Deactivate matching of reverse-complements. Normally, .B PatMaN will try to match patterns both literally and after reverse-complementing them, with this option set, only straight forward matches are considered. .IP "-p num, --prefetch num" Causes .I num pointers to be prefetched in advance. This feature can improve performance, if .B PatMaN has been compiled for a processor architecture that supports prefetching. The optimum value for your particular setup has to be determined empirically, but the default should be reasonably good. .IP "-l len, --min-length len" Only consider patterns with a length of at least .IR len . Use this if your .I pattern collection contains short sequences that you don't want lots of possible matches reported for. .IP "-x num, --chop3 num" Cut off .I num bases from the 3' end of each .I pattern. Use this for .I patterns with damaged, edited, etc. 3' ends that should be ignored. The chopped bases are neither matched nor included in the reported match regions. .IP "-X num, --chop5 num" Cut off .I num bases from the 5' end of each .I pattern. Use this for .I patterns with damaged, edited, etc. 5' ends that should be ignored. The chopped bases are neither matched nor included in the reported match regions. .IP "-A, --adenine-hack" Allow adenine to be ignored in patterns. This is essentially equivalent to not counting gaps in the .I database, as long as it was an A that was gapped. Using .I -A can be computationally extremely expensive, both in terms of memory and time consumed. .IP "-q, --quiet" Suppress warnings (about unrecognized characters in input sequences or missing input files). Even without .IR -q , at most one such warning is given per run. .IP "-v, --verbose" Prints additional progress information to stderr. .IP "-d flags, --debug flags" Sets debugging flags to .IR flags . Flags may be the logical .I OR of any of the following values, each of which causes some output to appear on .IR stderr . Some of the values may only work if .B PatMaN has been compiled in debug mode. The default value is 1. .IP 1 Print warnings. Equivalent to not setting .IR -q . .IP 2 Print progress information. Equivalent to setting .IR -v . .IP 4 Dump the suffix trie of the .IR patterns . Only available in debug build. .IP 8 Count number of visited nodes and print that number in each iteration. Only available in debug build. .IP 16 Print total number of nodes fetched from memory after completing all .IR databases . .IP 32 Output .I database sequence while it is being matched. .SH NOTES .SS Non-Option Arguments Non-option arguments (bare filenames) are either treated as .I database or .I pattern files, depending on whether the .I -D or .I -P option was the the last that occurred before the filename. If neither .I -D nor .I -P was given, file names are treated as .I pattern files. If no .I database was given, it is instead read from standard input. Standard input can be explicitly given as either a .I database or a .I pattern file by using the filename "-". A warning is given if standard input is selected implicitly as .I database, an error message is given if no .I pattern files have been named at all. .SS Gapped Matching Allowing gaps often causes overlapping matches of single .I patterns at almost the same position. .B PatMaN makes no attempt to filter these redundant matches. Also note that allowing many gaps, and especially allowing an arbitrary amount of gaps through the .I -A hack can slow down .B PatMaN considerably and cause it to produce enormous amounts of output. The use of some sorty of post-processor to filter these is highly recommended. .SS Unknown Nucleotides Unknown nucleotides are most often encoded by the letter .BR N . If the .I --ambicodes option is not given, Ns in patterns are interpreted as unknown nucleotides and can never match without penalty. If .I --ambicodes is given, Ns in .I patterns are expanded just like the other amibuguity codes, and effectively work as wildcards. Unknown nucleotides can still be encoded by an .B X and will never match anything. The database is treated differently in that anything other than .IR A ", " C ", " G ", " T " and " U , including ambiguity codes, is treated as unknown and can never match without penalty. .SH FILES .I /etc/popt .RS The system wide configuration file for .BR popt (3). .B PatMaN identifies itself as "patman" to popt. .RE .I ~/.popt .RS Per user configuration file for .BR popt (3). .RE .SH BUGS None known. .SH AUTHOR Kay Pruefer .br Udo Stenzel .SH "SEE ALSO" .BR popt (3), fasta (5)