.\" Man page generated from reStructuredText.
.
.TH "ECOTAG" "1" "Jul 27, 2019" " 1.02 13" "OBITools"
.SH NAME
ecotag \- description of ecotag
.
.nr rst2man-indent-level 0
.
.de1 rstReportMargin
\\$1 \\n[an-margin]
level \\n[rst2man-indent-level]
level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
-
\\n[rst2man-indent0]
\\n[rst2man-indent1]
\\n[rst2man-indent2]
..
.de1 INDENT
.\" .rstReportMargin pre:
. RS \\$1
. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
. nr rst2man-indent-level +1
.\" .rstReportMargin post:
..
.de UNINDENT
. RE
.\" indent \\n[an-margin]
.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
.nr rst2man-indent-level -1
.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
..
.sp
\fI\%ecotag\fP is the tool that assigns sequences to a taxon based on 
sequence similarity. The program first searches the reference database for the 
reference sequence(s) (hereafter referred to as ‘primary reference sequence(s)’) showing the 
highest similarity with the query sequence. Then it looks for all other reference 
sequences (hereafter referred to as ‘secondary reference sequences’) whose 
similarity with the primary reference sequence(s) is equal or higher than the 
similarity between the primary reference and the query sequences. Finally, it 
assigns the query sequence to the most recent common ancestor of the primary and 
secondary reference sequences.
.sp
As input, \fIecotag\fP requires the sequences to be assigned, a reference database 
in fasta format, where each sequence is associated with a taxon identified 
by a unique \fItaxid\fP, and a taxonomy database where taxonomic information is stored 
for each \fItaxid\fP\&.
.INDENT 0.0
.INDENT 3.5
\fIExample:\fP
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
> ecotag \-d embl_r113  \-R ReferenceDB.fasta \e
  \-\-sort=count \-m 0.95 \-r seq.fasta > seq_tag.fasta
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The above command specifies that each sequence stored in \fBseq.fasta\fP 
is compared to those in the reference database called \fBReferenceDB.fasta\fP 
for taxonomic assignment. In the output file \fBseq_tag.fasta\fP, the sequences 
are sorted from highest to lowest counts. When there is no reference sequence 
with a similarity equal or higher than 0.95 for a given sequence, no taxonomic 
information is provided for this sequence in \fBseq_tag.fasta\fP\&.
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SH ECOTAG SPECIFIC OPTIONS
.INDENT 0.0
.TP
.B \-R <FILENAME>, \-\-ref\-database=<FILENAME>
<FILENAME> is the fasta file containing the reference sequences
.UNINDENT
.INDENT 0.0
.TP
.B \-m FLOAT, \-\-minimum\-identity=FLOAT
When the best match with the reference database present an identity
level below FLOAT, the taxonomic assignment for the sequence record
is not computed. The sequence record is nevertheless included in the
output file. FLOAT is included in a [0,1] interval.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-minimum\-circle=FLOAT
minimum identity considered for the assignment circle.
FLOAT is included in a [0,1] interval.
.UNINDENT
.INDENT 0.0
.TP
.B \-x RANK, \-\-explain=RANK
.UNINDENT
.INDENT 0.0
.TP
.B \-u, \-\-uniq
When this option is specified, the program first dereplicates the sequence
records to work on unique sequences only. This option greatly improves
the program’s speed, especially for highly redundant datasets.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-sort=<KEY>
The output is sorted based on the values of the relevant attribute.
.UNINDENT
.INDENT 0.0
.TP
.B \-r, \-\-reverse
The output is sorted in reverse order (should be used with the –sort option).
(Works even if the –sort option is not set, but could not find on what
the output is sorted).
.UNINDENT
.INDENT 0.0
.TP
.B \-E FLOAT, \-\-errors=FLOAT
FLOAT is the fraction of reference sequences that will
be ignored when looking for the lowest common ancestor. This
option is useful when a non\-negligible proportion of reference sequences
is expected to be assigned to the wrong taxon, for example because of
taxonomic misidentification. FLOAT is included in a [0,1] interval.
.UNINDENT
.INDENT 0.0
.TP
.B \-M INTEGER, \-\-min\-matches=FLOAT
Define the minimum congruent assignation. If this minimum is reached and
the \-E option is activated, the lowest common ancestor algorithm tolarated
that some sequences do not provide the same taxonomic annotation (see the
\-E option).
.UNINDENT
.INDENT 0.0
.TP
.B \-\-cache\-size=INTEGER
A cache for computed similarities is maintained by \fIecotag\fP\&. the default
size for this cache is 1,000,000 of scores. This option allows to change
the cache size.
.UNINDENT
.SH TAXONOMY RELATED OPTIONS
.INDENT 0.0
.TP
.B \-d <FILENAME>, \-\-database=<FILENAME>
ecoPCR taxonomy Database name
.UNINDENT
.INDENT 0.0
.TP
.B \-t <FILENAME>, \-\-taxonomy\-dump=<FILENAME>
NCBI Taxonomy dump repository name
.UNINDENT
.SH OPTIONS TO SPECIFY INPUT FORMAT
.SS Restrict the analysis to a sub\-part of the input file
.INDENT 0.0
.TP
.B \-\-skip <N>
The N first sequence records of the file are discarded from the analysis and
not reported to the output file
.UNINDENT
.INDENT 0.0
.TP
.B \-\-only <N>
Only the N next sequence records of the file are analyzed. The following sequences
in the file are neither analyzed, neither reported to the output file.
This option can be used conjointly with the \fI–skip\fP option.
.UNINDENT
.SS Sequence annotated format
.INDENT 0.0
.TP
.B \-\-genbank
Input file is in genbank format.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-embl
Input file is in embl format.
.UNINDENT
.SS fasta related format
.INDENT 0.0
.TP
.B \-\-fasta
Input file is in fasta format (including
OBITools fasta extensions).
.UNINDENT
.SS fastq related format
.INDENT 0.0
.TP
.B \-\-sanger
Input file is in Sanger fastq format (standard
fastq used by HiSeq/MiSeq sequencers).
.UNINDENT
.INDENT 0.0
.TP
.B \-\-solexa
Input file is in fastq format produced by
Solexa (Ga IIx) sequencers.
.UNINDENT
.SS ecoPCR related format
.INDENT 0.0
.TP
.B \-\-ecopcr
Input file is in ecoPCR format.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-ecopcrdb
Input is an ecoPCR database.
.UNINDENT
.SS Specifying the sequence type
.INDENT 0.0
.TP
.B \-\-nuc
Input file contains nucleic sequences.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-prot
Input file contains protein sequences.
.UNINDENT
.SH OPTIONS TO SPECIFY OUTPUT FORMAT
.SS Standard output format
.INDENT 0.0
.TP
.B \-\-fasta\-output
Output sequences in \fBOBITools\fP fasta format
.UNINDENT
.INDENT 0.0
.TP
.B \-\-fastq\-output
Output sequences in Sanger fastq format
.UNINDENT
.SS Generating an ecoPCR database
.INDENT 0.0
.TP
.B \-\-ecopcrdb\-output=<PREFIX_FILENAME>
Creates an ecoPCR database from sequence records results
.UNINDENT
.SS Miscellaneous option
.INDENT 0.0
.TP
.B \-\-uppercase
Print sequences in upper case (default is lower case)
.UNINDENT
.SH COMMON OPTIONS
.INDENT 0.0
.TP
.B \-h, \-\-help
Shows this help message and exits.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-DEBUG
Sets logging in debug mode.
.UNINDENT
.SH ECOTAG ADDED SEQUENCE ATTRIBUTES
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.INDENT 2.0
.IP \(bu 2
best_identity
.IP \(bu 2
best_match
.IP \(bu 2
family
.IP \(bu 2
family_name
.IP \(bu 2
genus
.UNINDENT
.INDENT 2.0
.IP \(bu 2
genus_name
.IP \(bu 2
id_status
.IP \(bu 2
order
.IP \(bu 2
order_name
.IP \(bu 2
rank
.UNINDENT
.INDENT 2.0
.IP \(bu 2
scientific_name
.IP \(bu 2
species
.IP \(bu 2
species_list
.IP \(bu 2
species_name
.IP \(bu 2
taxid
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SH AUTHOR
The OBITools Development Team - LECA
.SH COPYRIGHT
2019 - 2015, OBITool Development Team
.\" Generated by docutils manpage writer.
.