OBIGREP(1) | OBITools | OBIGREP(1) |
NAME¶
obigrep - description of obigrep
The obigrep command is in some way analog to the standard Unix grep command. It selects a subset of sequence records from a sequence file.
A sequence record is a complex object composed of an identifier, a set of attributes (key=value), a definition, and the sequence itself.
Instead of working text line by text line as the standard Unix tool, selection is done sequence record by sequence record. A large set of options allows refining selection on any of the sequence record elements.
Moreover obigrep allows specifying simultaneously several conditions (that take the value TRUE or FALSE) and only the sequence records that fulfill all the conditions (all conditions are TRUE) are selected.
SEQUENCE RECORD SELECTION OPTIONS¶
Examples:
> obigrep -s 'GAATTC' seq1.fasta > seq2.fasta
Selects only the sequence records that contain an EcoRI restriction site.
> obigrep -s 'A{10,}' seq1.fasta > seq2.fasta
Selects only the sequence records that contain a stretch of at least 10 A.
> obigrep -s '^[ACGT]+$' seq1.fasta > seq2.fasta
Selects only the sequence records that do not contain ambiguous nucleotides.
Example:
> obigrep -D '[Cc]hloroplast' seq1.fasta > seq2.fasta
Selects only the sequence records whose definition contains chloroplast or Chloroplast.
Example:
> obigrep -I '^GH' seq1.fasta > seq2.fasta
Selects only the sequence records whose identifier begins with GH.
Example:
> obigrep --id-list=my_id_list.txt seq1.fasta > seq2.fasta
Selects only the sequence records whose identifier is present in the my_id_list.txt file.
Example:
> obigrep -a 'family_name:Asteraceae' seq1.fasta > seq2.fasta
Selects the sequence records containing an attribute whose key is family_name and value is Asteraceae.
Example:
> obigrep -A taxid seq1.fasta > seq2.fasta
Selects only the sequence records having a taxid attribute defined.
Example:
> obigrep -p '(forward_error<2) and (reverse_error<2)' \
seq1.fasta > seq2.fasta
Selects only the sequence records whose forward_error and reverse_error attributes have a value smaller than two.
Example:
> obigrep -L 100 seq1.fasta > seq2.fasta
Selects only the sequence records that have a sequence length equal or shorter than 100bp.
Examples:
> obigrep -l 100 seq1.fasta > seq2.fasta
Selects only the sequence records that have a sequence length equal or longer than 100bp.
Examples:
> obigrep -v -l 100 seq1.fasta > seq2.fasta
Selects only the sequence records that have a sequence length shorter than 100bp.
TAXONOMY RELATED OPTIONS¶
- -d <FILENAME>, --database=<FILENAME>
- ecoPCR taxonomy Database name
- -t <FILENAME>, --taxonomy-dump=<FILENAME>
- NCBI Taxonomy dump repository name
- --require-rank=<RANK_NAME>
- select sequence with taxid tag containing a parent of rank <RANK_NAME>
- -r <TAXID>, --required=<TAXID>
- required taxid
- -i <TAXID>, --ignore=<TAXID>
- ignored taxid
OPTIONS TO SPECIFY INPUT FORMAT¶
Restrict the analysis to a sub-part of the input file¶
- --skip <N>
- The N first sequence records of the file are discarded from the analysis and not reported to the output file
- --only <N>
- Only the N next sequence records of the file are analyzed. The following sequences in the file are neither analyzed, neither reported to the output file. This option can be used conjointly with the –skip option.
Sequence annotated format¶
- --genbank
- Input file is in genbank format.
- --embl
- Input file is in embl format.
fasta related format¶
- --fasta
- Input file is in fasta format (including OBITools fasta extensions).
fastq related format¶
- --sanger
- Input file is in Sanger fastq format (standard fastq used by HiSeq/MiSeq sequencers).
- --solexa
- Input file is in fastq format produced by Solexa (Ga IIx) sequencers.
ecoPCR related format¶
- --ecopcr
- Input file is in ecoPCR format.
- --ecopcrdb
- Input is an ecoPCR database.
Specifying the sequence type¶
- --nuc
- Input file contains nucleic sequences.
- --prot
- Input file contains protein sequences.
OPTIONS TO SPECIFY OUTPUT FORMAT¶
Standard output format¶
- --fasta-output
- Output sequences in OBITools fasta format
- --fastq-output
- Output sequences in Sanger fastq format
Generating an ecoPCR database¶
- --ecopcrdb-output=<PREFIX_FILENAME>
- Creates an ecoPCR database from sequence records results
Miscellaneous option¶
- --uppercase
- Print sequences in upper case (default is lower case)
COMMON OPTIONS¶
- -h, --help
- Shows this help message and exits.
- --DEBUG
- Sets logging in debug mode.
AUTHOR¶
The OBITools Development Team - LECA
COPYRIGHT¶
2019 - 2015, OBITool Development Team
July 27, 2019 | 1.02 13 |