GFFREAD(1)

User Commands

GFFREAD(1)

NAME¶

gffread - GFF/GTF utility providing format conversions, region filtering, FASTA sequence extraction

SYNOPSIS¶

gffread <input_gff> [-g <genomic_seqs_fasta> | <dir>][-s <seq_info.fsize>] [-o <outfile.gff>] [-t <tname>] [-r [[<strand>]<chr>:]<start>..<end> [-R]] [-CTVNJMKQAFPGUBHZWTOLE] [-w <exons.fa>] [-x <cds.fa>] [-y <tr_cds.fa>] [-i <maxintron>] [--sort-by <refseq_list.txt>]

DESCRIPTION¶

: Filter and convert GFF3/GTF2 records, extract corresponding sequences etc. By default (i.e. without -O) only process transcripts, ignore other features.
: <input_gff> is a GFF file, use '-' for stdin

OPTIONS¶

-i: discard transcripts having an intron larger than <maxintron>
-l: discard transcripts shorter than <minlen> bases
-r: only show transcripts overlapping coordinate range <start>..<end> (on chromosome/contig <chr>, strand <strand> if provided)
-R: for -r option, discard all transcripts that are not fully contained within the given range
-U: discard single-exon transcripts
-C: coding only: discard mRNAs that have no CDS features

--nc non-coding only: discard mRNAs that have CDS features

--ignore-locus : discard locus features and attributes found in the input

-A: use the description field from <seq_info.fsize> and add it as the value for a 'descr' attribute to the GFF record
-s: <seq_info.fsize> is a tab-delimited file providing this info for each of the mapped sequences: <seq-name> <seq-length> <seq-description> (useful for -A option with mRNA/EST/protein mappings)

Sorting: (by default, chromosomes are kept in the order they were found)

--sort-alpha : chromosomes (reference sequences) are sorted alphabetically

--sort-by : sort the reference sequences by the order in which their

: names are given in the <refseq.lst> file

Misc options:¶

-F: attempt to preserve all GFF attributes preservation

--keep-exon-attrs : for -F option, do not attempt to reduce redundant

: exon/CDS attributes

-G: do not keep exon attributes, move them to the transcript feature (for GFF3 output)

--keep-genes : in transcript-only mode (default), also preserve gene records

--keep-comments: for GFF3 input/output, try to preserve comments

-O: process other non-transcript GFF records (by default non-transcript records are ignored)
-V: discard any mRNAs with CDS having in-frame stop codons (requires -g)
-H: for -V option, check and adjust the starting CDS phase if the original phase leads to a translation with an in-frame stop codon
-B: for -V option, single-exon transcripts are also checked on the opposite strand (requires -g)
-P: add transcript level GFF attributes about the coding status of each transcript, including partialness or in-frame stop codons (requires -g)

--add-hasCDS : add a "hasCDS" attribute with value "true" for transcripts

: that have CDS features

--adj-stop stop codon adjustment: enables -P and performs automatic

: adjustment of the CDS stop coordinate if premature or downstream

-N: discard multi-exon mRNAs that have any intron with a non-canonical splice site consensus (i.e. not GT-AG, GC-AG or AT-AC)
-J: discard any mRNAs that either lack initial START codon or the terminal STOP codon, or have an in-frame stop codon (i.e. only print mRNAs with a complete CDS)

--no-pseudo: filter out records matching the 'pseudo' keyword

--in-bed: input should be parsed as BED format (automatic if the input

: filename ends with .bed*)

--in-tlf: input GFF-like one-line-per-transcript format without exon/CDS

: features (see --tlf option below); automatic if the input filename ends with .tlf)

Clustering:¶

-M/--merge : cluster the input transcripts into loci, discarding

: "duplicated" transcripts (those with the same exact introns and fully contained or equal boundaries)

-d <dupinfo> : for -M option, write duplication info to file <dupinfo>

--cluster-only: same as -M/--merge but without discarding any of the

: "duplicate" transcripts, only create "locus" features

-K: for -M option: also discard as redundant the shorter, fully contained

: transcripts (intron chains matching a part of the container)

-Q: for -M option, no longer require boundary containment when assessing redundancy (can be combined with -K); only introns have to match for multi-exon transcripts, and >=80% overlap for single-exon transcripts
-Y: for -M option, enforce -Q but also discard overlapping single-exon transcripts, even on the opposite strand (can be combined with -K)

Output options:¶

--force-exons: make sure that the lowest level GFF features are considered

: "exon" features

--gene2exon: for single-line genes not parenting any transcripts, add an

: exon feature spanning the entire gene (treat it as a transcript)

-D: decode url encoded characters within attributes
-Z: merge very close exons into a single exon (when intron size<4)
-g: full path to a multi-fasta file with the genomic sequences for all input mappings, OR a directory with single-fasta files (one per genomic sequence, with file names matching sequence names)
-w: write a fasta file with spliced exons for each GFF transcript
-x: write a fasta file with spliced CDS for each GFF transcript
-y: write a protein fasta file with the translation of CDS for each record
-W: for -w and -x options, write in the FASTA defline the exon coordinates projected onto the spliced sequence; for -y option, write transcript attributes in the FASTA defline
-S: for -y option, use '*' instead of '.' as stop codon translation
-L: Ensembl GTF to GFF3 conversion (implies -F; should be used with -m)
-m: <chr_replace> is a name mapping table for converting reference sequence names, having this 2-column format: <original_ref_ID> <new_ref_ID> WARNING: all GFF records on reference sequences whose original IDs are not found in the 1st column of this table will be discarded!
-t: use <trackname> in the 2nd column of each GFF/GTF output line
-o: print the GFF records to <outfile.gff> (those that passed any given filters). Use -o- to enable printing of to stdout
-T: for -o, output will be GTF instead of GFF3

--bed for -o, output BED format instead of GFF3

--tlf for -o, output "transcript line format" which is like GFF

: but exons, CDS features and related data are stored as GFF attributes in the transcript feature line, like this:
: exoncount=N;exons=<exons>;CDSphase=<N>;CDS=<CDScoords>
: <exons> is a comma-delimited list of exon_start-exon_end coordinates; <CDScoords> is CDS_start:CDS_end coordinates or a list like <exons>;

-v,-E expose (warn about) duplicate transcript IDs and other potential

: problems with the given GFF/GTF records

AUTHOR¶

This manpage was written by Andreas Tille for the Debian distribution and can be used for any other usage of the program.

June 2019

gffread 0.11.2

Source file:	gffread.1.en.gz (from gffread 0.12.1-4)
Source last updated:	2021-01-08T07:52:52Z
Converted to HTML:	2022-11-18T16:00:49Z