DNACLUST(1) | Bioinformatics Tools | DNACLUST(1) |
NAME¶
dnaclust - program to cluster large number of short DNA sequencesSYNOPSIS¶
dnaclust
{-i | --input} infile
[{-s | --similarity} threshold]
[{-m | --multiple-alignment}]
[{-d | --header}]
[{-l | --left-gaps-allowed}] [{
-k | --k-mer-length} length]
[{-a | --approximate-filter}]
[--no-k-mer-filter]
dnaclust [{-h | --help} |
{-v | --version}]
DESCRIPTION¶
This manual page documents briefly the dnaclust program. dnaclust is a tool for clustering large number of short DNA sequences. The clusters are created in such a way that the "radius" of each clusters is no more than the specified threshold. The input sequences to be clustered should be in Fasta format. The id of each sequence is based on the first word of the seqeunce in the Fasta format. The first word is the prefix of the header up to the first occurance of white space characters in the header. The output is written to STDOUT. If you want the output to be written to a file, just redirect the output (See Examples). The output has two modes: the default clustering mode, and clustering with multiple sequence alignment. In the clustering mode (without multiple alignment), each cluster will be printed on a separate line. The line will contain the ids of the sequences in the cluster. The first id in each line is the cluster center sequence id. Because of the way our clusters are constructed, the length of the cluster center sequence is always greater than or equal to the length of any of the sequences in the cluster. Please note that since usually some clusters contain many sequences, the lines of the output may be very long. If you want to visually inspect the output, please use the 'less -S', or an editor that does not wrap long lines. The number of clusters can be found using 'wc -l'. For more information about the multiple sequence alignment mode see the description of --multiple-alignment option.OPTIONS¶
The program follows the usual GNU command line syntax, with long options starting with two dashes ('-'). A summary of options is included below. --similarity threshold, -s thresholdThe similarity threshold specifies the radius of the
clusters created. This parameter is a floating point number between 0 and 1.
It is calculated based on semi-global alignment of a sequence to the cluster
center sequence. Namely similarity = 1 - (edit distance) / (length of the
shorter sequence). The edit distance is the minimum number of insertions,
deletions or substitutions necessary to aling a sequence to the cluster center
sequence. Our algorithms are faster when the similarity is higher.
--k-mer-length length, -k length
When you use the k-mer filter (which is enabled by
default) you can specify the maximum length of the k-mers used for filtering.
The longer k-mer lengths require more memory to store k-mer counts and the
filtering will be slower. However with the longer k-mer length, the filter
will be more specific and therefore the sequence alignment search may be
faster.
There is a tradeoff between filtering and searching time. If you do not specify
the k-mer length a value of log4(median of the lengths of the input sequences)
is picked automatically. By using this option you can override the default
value.
Keep in mind, however, that longer k-mer lengths would require more memory to
store the filtering data structures.
--approximate-filter , -a
By default the k-mer filter is 100 percent sensitive.
This means that in the output clustering, no two cluster centers are within
the threshold distance from each other. The exact filter, however, is somewhat
slow. This option speeds up the filter by using a heuristic. The use of the
approximate filter may result in cluster centers that are close, and a larger
number clusters overall. However the approximate filter is usually several
times faster than the exact sensitive filter. Use this option if you are
clustering primarily to reduce the redundancy in the data, and do not care
about the quality of the clustering.
--allow-left-gaps , -l
With this option the distances are measured based on
semi-global alignment. The semi-global alignment allows for gaps without
penalty on both ends of the shorter sequence.
The default alignment is a one sided semi-global alignment. i.e. gaps are only
allowed at the right end of the shorter sequence without penalty. This
behavior corresponds to the data from targeted sequening of a region (e.g. of
16S ribosomal RNA gene).
--multiple-alignment, -m
Set the output format to show the multiple sequence
alignment of each cluster. The gaps in the alignments are represented with the
dash '-' charachter.
The format of the MSA output is as follows: The MSA of each cluster spans
several lines. The MSA starts with a line containing charachter '#' followed
by the number of sequences in that cluster. The aligned sequences (which may
contain gaps) follow in the Fasta format. Each Fasta record will be composed
of two lines. The header line and the sequence line. Since each aligned
sequence is output on a single line, the output may contain very long lines.
Please use 'less -S', or an editor that does not wrap long lines for
inspecting the MSA also.
--no-k-mer-filter
Disables the k-mer filter. Suitable for clustering very
short sequences at a high similarity threshold.
-d, --header
Write program options to output.
-h, --help
Show summary of options.
-v, --version
Show version of program.
EXAMPLES¶
./dnaclust file.fasta -l -s 0.98 -k 3 > clusters
BUGS¶
The program is currently limited to only work with the boost library.SEE ALSO¶
awk(1), less(1), wc(1)AUTHOR¶
Mohammadreza Ghodsi <ghodsi@cs.umd.edu>Designed and developed and maintains this package. Wrote
this manpage.
COPYRIGHT¶
Copyright © 2010 Mohammadreza Ghodsi02/01/2014 | dnaclust |