.\" Man page generated from reStructuredText.
.
.TH "OBIANNOTATE" "1" "Jul 27, 2019" " 1.02 13" "OBITools"
.SH NAME
obiannotate \- description of obiannotate
.
.nr rst2man-indent-level 0
.
.de1 rstReportMargin
\\$1 \\n[an-margin]
level \\n[rst2man-indent-level]
level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
-
\\n[rst2man-indent0]
\\n[rst2man-indent1]
\\n[rst2man-indent2]
..
.de1 INDENT
.\" .rstReportMargin pre:
. RS \\$1
. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
. nr rst2man-indent-level +1
.\" .rstReportMargin post:
..
.de UNINDENT
. RE
.\" indent \\n[an-margin]
.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
.nr rst2man-indent-level -1
.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
..
.sp
\fI\%obiannotate\fP is the command that allows adding/modifying/removing 
annotation attributes attached to sequence records.
.sp
Once such attributes are added, they can be used by the other OBITools commands for 
filtering purposes or for statistics computing.
.sp
\fIExample 1:\fP
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
> obiannotate \-S short:\(aqlen(sequence)<100\(aq seq1.fasta > seq2.fasta
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The above command adds an attribute named \fIshort\fP which has a boolean value indicating whether the sequence length is less than 100bp.
.UNINDENT
.UNINDENT
.sp
\fIExample 2:\fP
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
> obiannotate \-\-seq\-rank seq1.fasta | \e
  obiannotate \-C \-\-set\-identifier \(aq"\(aqFungA\(aq_%05d" % seq_rank\(aq \e
  > seq2.fasta
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The above command adds a new attribute whose value is the sequence record 
entry number in the file. Then it clears all the sequence record attributes 
and sets the identifier to a string beginning with \fIFungA_\fP followed by a 
suffix with 5 digits containing the sequence entry number.
.UNINDENT
.UNINDENT
.sp
\fIExample 3:\fP
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
> obiannotate \-d my_ecopcr_database \e
  \-\-with\-taxon\-at\-rank=genus seq1.fasta > seq2.fasta
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The above command adds taxonomic information at the \fIgenus\fP rank to the 
sequence records.
.UNINDENT
.UNINDENT
.sp
\fIExample 4:\fP
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
> obiannotate \-S \(aqnew_seq:str(sequence).replace("a","t")\(aq \e
  seq1.fasta | obiannotate \-\-set\-sequence new_seq > seq2.fasta
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The overall aim of the above command is to edit the \fIsequence\fP object itself, 
by replacing all nucleotides \fIa\fP by nucleotides \fIt\fP\&. First, a new attribute 
named \fInew_seq\fP is created, which contains the modified sequence, and then 
the former sequence is replaced by the modified one.
.UNINDENT
.UNINDENT
.SH SEQUENCE RECORD EDITING OPTIONS
.INDENT 0.0
.TP
.B \-\-seq\-rank
Adds a new attribute named \fBseq_rank\fP to the sequence record indicating
its entry number in the sequence file.
.UNINDENT
.INDENT 0.0
.TP
.B \-R <OLD_NAME>:<NEW_NAME>, \-\-rename\-tag=<OLD_NAME>:<NEW_NAME>
Changes attribute name <OLD_NAME> to <NEW_NAME>. When attribute
named <OLD_NAME> is missing, the sequence record is
skipped and the next one is examined.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-delete\-tag=<KEY>
Deletes attribute named <ATTRIBUTE_NAME>.When this attribute
is missing, the sequence record is skipped and the
next one is examined.
.UNINDENT
.INDENT 0.0
.TP
.B \-S <KEY>:<PYTHON_EXPRESSION>, \-\-set\-tag=<KEY>:<PYTHON_EXPRESSION>
Creates a new attribute named with a key <KEY> and a
value computed from <PYTHON_EXPRESSION>.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-tag\-list=<FILENAME>
<FILENAME> points to a file containing attribute
names and values to modify for specified sequence records.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-set\-identifier=<PYTHON_EXPRESSION>
Sets sequence record identifier with a value computed
from <PYTHON_EXPRESSION>.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-run=<PYTHON_EXPRESSION>
Runs a python expression on each selected sequence.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-set\-sequence=<PYTHON_EXPRESSION>
Changes the sequence itself with a value computed from
<PYTHON_EXPRESSION>.
.UNINDENT
.INDENT 0.0
.TP
.B \-T, \-\-set\-definition=<PYTHON_EXPRESSION>
Sets sequence definition with a value computed from
<PYTHON_EXPRESSION>.
.UNINDENT
.INDENT 0.0
.TP
.B \-O, \-\-only\-valid\-python
Allows only valid python expressions.
.UNINDENT
.INDENT 0.0
.TP
.B \-C, \-\-clear
Clears all attributes associated to the sequence records.
.UNINDENT
.INDENT 0.0
.TP
.B \-k <KEY>, \-\-keep=<KEY>
Keeps only attribute with key <KEY>. Several \fB\-k\fP
options can be combined.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-length
Adds attribute with \fBseq_length\fP as a key and sequence length as a value.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-with\-taxon\-at\-rank=<RANK_NAME>
Adds taxonomic annotation at taxonomic rank
<RANK_NAME>.
.UNINDENT
.INDENT 0.0
.TP
.B \-m <MCLFILE>, \-\-mcl=<MCLFILE>
Creates a new attribute containing the number of the
cluster the sequence record was assigned to, as
indicated in file <MCLFILE>.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-uniq\-id
Forces sequence record ids to be unique.
.UNINDENT
.SH SEQUENCE RECORD SELECTION OPTIONS
.INDENT 0.0
.TP
.B \-s <REGULAR_PATTERN>, \-\-sequence=<REGULAR_PATTERN>
.INDENT 7.0
.INDENT 3.5
Regular expression pattern to be tested against the
sequence itself. The pattern is case insensitive.
.UNINDENT
.UNINDENT
.sp
\fIExamples:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
> obigrep \-s \(aqGAATTC\(aq seq1.fasta > seq2.fasta
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Selects only the sequence records that contain an \fIEcoRI\fP restriction site.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
> obigrep \-s \(aqA{10,}\(aq seq1.fasta > seq2.fasta
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Selects only the sequence records that contain a stretch of at least 10 \fBA\fP\&.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
> obigrep \-s \(aq^[ACGT]+$\(aq seq1.fasta > seq2.fasta
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Selects only the sequence records that do not contain ambiguous nucleotides.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B \-D <REGULAR_PATTERN>, \-\-definition=<REGULAR_PATTERN>
.INDENT 7.0
.INDENT 3.5
Regular expression pattern to be tested against the
definition of the sequence record. The pattern is case
sensitive.
.UNINDENT
.UNINDENT
.sp
\fIExample:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
> obigrep \-D \(aq[Cc]hloroplast\(aq seq1.fasta > seq2.fasta
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Selects only the sequence records whose definition contains \fBchloroplast\fP or
\fBChloroplast\fP\&.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B \-I <REGULAR_PATTERN>, \-\-identifier=<REGULAR_PATTERN>
.INDENT 7.0
.INDENT 3.5
Regular expression pattern to be tested against the
identifier of the sequence record. The pattern is case
sensitive.
.UNINDENT
.UNINDENT
.sp
\fIExample:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
> obigrep \-I \(aq^GH\(aq seq1.fasta > seq2.fasta
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Selects only the sequence records whose identifier begins with \fBGH\fP\&.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B \-\-id\-list=<FILENAME>
.INDENT 7.0
.INDENT 3.5
\fB<FILENAME>\fP points to a text file containing the list of sequence
record identifiers to be selected.
The file format consists in a single identifier per line.
.UNINDENT
.UNINDENT
.sp
\fIExample:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
> obigrep \-\-id\-list=my_id_list.txt seq1.fasta > seq2.fasta
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Selects only the sequence records whose identifier is present in the
\fBmy_id_list.txt\fP file.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B \-a <KEY>:<REGULAR_PATTERN>,
.UNINDENT
.INDENT 0.0
.TP
.B \-\-attribute=<KEY>:<REGULAR_PATTERN>
.INDENT 7.0
.INDENT 3.5
Regular expression pattern matched against the
attributes of the sequence record\&. the value of this attribute
is of the form : key:regular_pattern. The
pattern is case sensitive. Several \fB\-a\fP options can be
used on the same command line and in this last case,
the selected sequence records will match all constraints.
.UNINDENT
.UNINDENT
.sp
\fIExample:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
> obigrep \-a \(aqfamily_name:Asteraceae\(aq seq1.fasta > seq2.fasta
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Selects the sequence records containing an attribute whose key is \fBfamily_name\fP and value
is \fBAsteraceae\fP\&.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B \-A <ATTRIBUTE_NAME>, \-\-has\-attribute=<KEY>
.INDENT 7.0
.INDENT 3.5
Selects sequence records having an attribute whose key = <KEY>.
.UNINDENT
.UNINDENT
.sp
\fIExample:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
> obigrep \-A taxid seq1.fasta > seq2.fasta
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Selects only the sequence records having a \fItaxid\fP attribute defined.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B \-p <PYTHON_EXPRESSION>, \-\-predicat=<PYTHON_EXPRESSION>
.INDENT 7.0
.INDENT 3.5
Python boolean expression to be evaluated for each
sequence record. The attribute keys defined for each sequence record
can be used in the expression as variable names.
An extra variable named ‘sequence’ refers to the
sequence record itself.
Several \-p options can be used on the same command
line and in this last case,
the selected sequence records will match all constraints.
.UNINDENT
.UNINDENT
.sp
\fIExample:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>  obigrep \-p \(aq(forward_error<2) and (reverse_error<2)\(aq \e
   seq1.fasta > seq2.fasta
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Selects only the sequence records whose \fBforward_error\fP and \fBreverse_error\fP
attributes have a value smaller than two.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B \-L <##>, \-\-lmax=<##>
.INDENT 7.0
.INDENT 3.5
Keeps sequence records whose sequence length is
equal or shorter than \fBlmax\fP\&.
.UNINDENT
.UNINDENT
.sp
\fIExample:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
> obigrep \-L 100 seq1.fasta > seq2.fasta
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Selects only the sequence records that have a sequence
length equal or shorter than 100bp.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B \-l <##>, \-\-lmin=<##>
.INDENT 7.0
.INDENT 3.5
Selects sequence records whose sequence length is
equal or longer than \fBlmin\fP\&.
.UNINDENT
.UNINDENT
.sp
\fIExamples:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
> obigrep \-l 100 seq1.fasta > seq2.fasta
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Selects only the sequence records that have a sequence length
equal or longer than 100bp.
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.TP
.B \-v, \-\-inverse\-match
.INDENT 7.0
.INDENT 3.5
Inverts the sequence record selection.
.UNINDENT
.UNINDENT
.sp
\fIExamples:\fP
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
> obigrep \-v \-l 100 seq1.fasta > seq2.fasta
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Selects only the sequence records that have a sequence length shorter than 100bp.
.UNINDENT
.UNINDENT
.UNINDENT
.SH TAXONOMY RELATED OPTIONS
.INDENT 0.0
.TP
.B \-d <FILENAME>, \-\-database=<FILENAME>
ecoPCR taxonomy Database name
.UNINDENT
.INDENT 0.0
.TP
.B \-t <FILENAME>, \-\-taxonomy\-dump=<FILENAME>
NCBI Taxonomy dump repository name
.UNINDENT
.INDENT 0.0
.TP
.B \-\-require\-rank=<RANK_NAME>
select sequence with taxid tag containing a parent of
rank <RANK_NAME>
.UNINDENT
.INDENT 0.0
.TP
.B \-r <TAXID>, \-\-required=<TAXID>
required taxid
.UNINDENT
.INDENT 0.0
.TP
.B \-i <TAXID>, \-\-ignore=<TAXID>
ignored taxid
.UNINDENT
.SH OPTIONS TO SPECIFY INPUT FORMAT
.SS Restrict the analysis to a sub\-part of the input file
.INDENT 0.0
.TP
.B \-\-skip <N>
The N first sequence records of the file are discarded from the analysis and
not reported to the output file
.UNINDENT
.INDENT 0.0
.TP
.B \-\-only <N>
Only the N next sequence records of the file are analyzed. The following sequences
in the file are neither analyzed, neither reported to the output file.
This option can be used conjointly with the \fI–skip\fP option.
.UNINDENT
.SS Sequence annotated format
.INDENT 0.0
.TP
.B \-\-genbank
Input file is in genbank format.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-embl
Input file is in embl format.
.UNINDENT
.SS fasta related format
.INDENT 0.0
.TP
.B \-\-fasta
Input file is in fasta format (including
OBITools fasta extensions).
.UNINDENT
.SS fastq related format
.INDENT 0.0
.TP
.B \-\-sanger
Input file is in Sanger fastq format (standard
fastq used by HiSeq/MiSeq sequencers).
.UNINDENT
.INDENT 0.0
.TP
.B \-\-solexa
Input file is in fastq format produced by
Solexa (Ga IIx) sequencers.
.UNINDENT
.SS ecoPCR related format
.INDENT 0.0
.TP
.B \-\-ecopcr
Input file is in ecoPCR format.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-ecopcrdb
Input is an ecoPCR database.
.UNINDENT
.SS Specifying the sequence type
.INDENT 0.0
.TP
.B \-\-nuc
Input file contains nucleic sequences.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-prot
Input file contains protein sequences.
.UNINDENT
.SH OPTIONS TO SPECIFY OUTPUT FORMAT
.SS Standard output format
.INDENT 0.0
.TP
.B \-\-fasta\-output
Output sequences in \fBOBITools\fP fasta format
.UNINDENT
.INDENT 0.0
.TP
.B \-\-fastq\-output
Output sequences in Sanger fastq format
.UNINDENT
.SS Generating an ecoPCR database
.INDENT 0.0
.TP
.B \-\-ecopcrdb\-output=<PREFIX_FILENAME>
Creates an ecoPCR database from sequence records results
.UNINDENT
.SS Miscellaneous option
.INDENT 0.0
.TP
.B \-\-uppercase
Print sequences in upper case (default is lower case)
.UNINDENT
.SH COMMON OPTIONS
.INDENT 0.0
.TP
.B \-h, \-\-help
Shows this help message and exits.
.UNINDENT
.INDENT 0.0
.TP
.B \-\-DEBUG
Sets logging in debug mode.
.UNINDENT
.SH OBIANNOTATE ADDED SEQUENCE ATTRIBUTES
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.INDENT 2.0
.IP \(bu 2
seq_length
.IP \(bu 2
seq_rank
.IP \(bu 2
cluster
.IP \(bu 2
scientific_name
.IP \(bu 2
taxid
.UNINDENT
.INDENT 2.0
.IP \(bu 2
rank
.IP \(bu 2
family
.IP \(bu 2
family_name
.IP \(bu 2
genus
.IP \(bu 2
genus_name
.UNINDENT
.INDENT 2.0
.IP \(bu 2
order
.IP \(bu 2
order_name
.IP \(bu 2
species
.IP \(bu 2
species_name
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SH AUTHOR
The OBITools Development Team - LECA
.SH COPYRIGHT
2019 - 2015, OBITool Development Team
.\" Generated by docutils manpage writer.
.