.\" Automatically generated by Pod::Man 4.07 (Pod::Simple 3.32) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .if !\nF .nr F 0 .if \nF>0 \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} .\} .\" .\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2). .\" Fear. Run. Save yourself. No user-serviceable parts. . \" fudge factors for nroff and troff .if n \{\ . ds #H 0 . ds #V .8m . ds #F .3m . ds #[ \f1 . ds #] \fP .\} .if t \{\ . ds #H ((1u-(\\\\n(.fu%2u))*.13m) . ds #V .6m . ds #F 0 . ds #[ \& . ds #] \& .\} . \" simple accents for nroff and troff .if n \{\ . ds ' \& . ds ` \& . ds ^ \& . ds , \& . ds ~ ~ . ds / .\} .if t \{\ . ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u" . ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u' . ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u' . ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u' . ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u' . ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u' .\} . \" troff and (daisy-wheel) nroff accents .ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V' .ds 8 \h'\*(#H'\(*b\h'-\*(#H' .ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#] .ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H' .ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u' .ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#] .ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#] .ds ae a\h'-(\w'a'u*4/10)'e .ds Ae A\h'-(\w'A'u*4/10)'E . \" corrections for vroff .if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u' .if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u' . \" for low resolution devices (crt and lpr) .if \n(.H>23 .if \n(.V>19 \ \{\ . ds : e . ds 8 ss . ds o a . ds d- d\h'-1'\(ga . ds D- D\h'-1'\(hy . ds th \o'bp' . ds Th \o'LP' . ds ae ae . ds Ae AE .\} .rm #[ #] #H #V #F C .\" ======================================================================== .\" .IX Title "BP_GENBANK2GFF3 1p" .TH BP_GENBANK2GFF3 1p "2017-01-15" "perl v5.24.1" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" bp_genbank2gff3.pl \-\- Genbank\->gbrowse\-friendly GFF3 .SH "SYNOPSIS" .IX Header "SYNOPSIS" .Vb 1 \& bp_genbank2gff3.pl [options] filename(s) \& \& # process a directory containing GenBank flatfiles \& perl bp_genbank2gff3.pl \-\-dir path_to_files \-\-zip \& \& # process a single file, ignore explicit exons and introns \& perl bp_genbank2gff3.pl \-\-filter exon \-\-filter intron file.gbk.gz \& \& # process a list of files \& perl bp_genbank2gff3.pl *gbk.gz \& \& # process data from URL, with Chado GFF model (\-noCDS), and pipe to database loader \& curl ftp://ftp.ncbi.nih.gov/genomes/Saccharomyces_cerevisiae/CHR_X/NC_001142.gbk \e \& | perl bp_genbank2gff3.pl \-noCDS \-in stdin \-out stdout \e \& | perl gmod_bulk_load_gff3.pl \-dbname mychado \-organism fromdata \& \& Options: \& \-\-noinfer \-r don\*(Aqt infer exon/mRNA subfeatures \& \-\-conf \-i path to the curation configuration file that contains user preferences \& for Genbank entries (must be YAML format) \& (if \-\-manual is passed without \-\-ini, user will be prompted to \& create the file if any manual input is saved) \& \-\-sofile \-l path to to the so.obo file to use for feature type mapping \& (\-\-sofile live will download the latest online revision) \& \-\-manual \-m when trying to guess the proper SO term, if more than \& one option matches the primary tag, the converter will \& wait for user input to choose the correct one \& (only works with \-\-sofile) \& \-\-dir \-d path to a list of genbank flatfiles \& \-\-outdir \-o location to write GFF files (can be \*(Aqstdout\*(Aq or \*(Aq\-\*(Aq for pipe) \& \-\-zip \-z compress GFF3 output files with gzip \& \-\-summary \-s print a summary of the features in each contig \& \-\-filter \-x genbank feature type(s) to ignore \& \-\-split \-y split output to separate GFF and fasta files for \& each genbank record \& \-\-nolump \-n separate file for each reference sequence \& (default is to lump all records together into one \& output file for each input file) \& \-\-ethresh \-e error threshold for unflattener \& set this high (>2) to ignore all unflattener errors \& \-\-[no]CDS \-c Keep CDS\-exons, or convert to alternate gene\-RNA\-protein\-exon \& model. \-\-CDS is default. Use \-\-CDS to keep default GFF gene model, \& use \-\-noCDS to convert to g\-r\-p\-e. \& \-\-format \-f Input format (SeqIO types): GenBank, Swiss or Uniprot, EMBL work \& (GenBank is default) \& \-\-GFF_VERSION 3 is default, 2 and 2.5 and other Bio::Tools::GFF versions available \& \-\-quiet don\*(Aqt talk about what is being processed \& \-\-typesource SO sequence type for source (e.g. chromosome; region; contig) \& \-\-help \-h display this message .Ve .SH "DESCRIPTION" .IX Header "DESCRIPTION" This script uses Bio::SeqFeature::Tools::Unflattener and Bio::Tools::GFF to convert GenBank flatfiles to \s-1GFF3\s0 with gene containment hierarchies mapped for optimal display in gbrowse. .PP The input files are assumed to be gzipped GenBank flatfiles for refseq contigs. The files may contain multiple GenBank records. Either a single file or an entire directory can be processed. By default, the \&\s-1DNA\s0 sequence is embedded in the \s-1GFF\s0 but it can be saved into separate fasta file with the \-\-split(\-y) option. .PP If an input file contains multiple records, the default behaviour is to dump all \s-1GFF\s0 and sequence to a file of the same name (with .gff appended). Using the 'nolump' option will create a separate file for each genbank record. Using the 'split' option will create separate \&\s-1GFF\s0 and Fasta files for each genbank record. .SS "Notes" .IX Subsection "Notes" \fI'split' and 'nolump' produce many files\fR .IX Subsection "'split' and 'nolump' produce many files" .PP In cases where the input files contain many GenBank records (for example, the chromosome files for the mouse genome build), a very large number of output files will be produced if the 'split' or \&'nolump' options are selected. If you do have lists of files > 6000, use the \-\-long_list option in bp_bulk_load_gff.pl or bp_fast_load_gff.pl to load the gff and/ or fasta files. .PP \fIDesigned for RefSeq\fR .IX Subsection "Designed for RefSeq" .PP This script is designed for RefSeq genomic sequence entries. It may work for third party annotations but this has not been tested. But see below, Uniprot/Swissprot works, \s-1EMBL\s0 and possibly EMBL/Ensembl if you don't mind some gene model unflattener errors (dgg). .PP \fIG\-R-P-E Gene Model\fR .IX Subsection "G-R-P-E Gene Model" .PP Don Gilbert worked this over with needs to produce \s-1GFF3\s0 suited to loading to \s-1GMOD\s0 Chado databases. Most of the changes I believe are suited for general use. One main chado-specific addition is the \-\-[no]cds2protein flag .PP My favorite \s-1GFF\s0 is to set the above as \s-1ON\s0 by default (disable with \-\-nocds2prot) For general use it probably should be \s-1OFF,\s0 enabled with \-\-cds2prot. .PP This writes \s-1GFF\s0 with an alternate, but useful Gene model, instead of the consensus model for \s-1GFF3 \s0 .PP .Vb 1 \& [ gene > mRNA> (exon,CDS,UTR) ] .Ve .PP This alternate is .PP .Vb 1 \& gene > mRNA > polypeptide > exon .Ve .PP means the only feature with dna bases is the exon. The others specify only location ranges on a genome. Exon of course is a child of mRNA and protein/peptide. .PP The protein/polypeptide feature is an important one, having all the annotations of the GenBank \s-1CDS\s0 feature, protein \s-1ID,\s0 translation, \s-1GO\s0 terms, Dbxrefs to other proteins. .PP UTRs, introns, CDS-exons are all inferred from the primary exon bases inside/outside appropriate higher feature ranges. Other special gene model features remain the same. .PP Several other improvements and bugfixes, minor but useful are included .PP .Vb 2 \& * IO pipes now work: \& curl ftp://ncbigenomes/... | bp_genbank2gff3 \-\-in stdin \-\-out stdout | gff2chado ... \& \& * GenBank main record fields are added to source feature, e.g. organism, date, \& and the sourcetype, commonly chromosome for genomes, is used. \& \& * Gene Model handling for ncRNA, pseudogenes are added. \& \& * GFF header is cleaner, more informative. \& \-\-GFF_VERSION flag allows choice of v2 as well as default v3 \& \& * GFF ##FASTA inclusion is improved, and \& CDS translation sequence is moved to FASTA records. \& \& * FT \-> GFF attribute mapping is improved. \& \& * \-\-format choice of SeqIO input formats (GenBank default). \& Uniprot/Swissprot and EMBL work and produce useful GFF. \& \& * SeqFeature::Tools::TypeMapper has a few FT \-> SOFA additions \& and more flexible usage. .Ve .SH "TODO" .IX Header "TODO" .SS "Are these additions desired?" .IX Subsection "Are these additions desired?" .Vb 3 \& * filter input records by taxon (e.g. keep only organism=xxx or taxa level = classYYY \& * handle Entrezgene, other non\-sequence SeqIO structures (really should change \& those parsers to produce consistent annotation tags). .Ve .SS "Related bugfixes/tests" .IX Subsection "Related bugfixes/tests" These items from Bioperl mail were tested (sample data generating errors), and found corrected: .PP .Vb 4 \& From: Ed Green eva.mpg.de> \& Subject: genbank2gff3.pl on new human RefSeq \& Date: 2006\-03\-13 21:22:26 GMT \& \-\- unspecified errors (sample data works now). \& \& From: Eric Just northwestern.edu> \& Subject: genbank2gff3.pl \& Date: 2007\-01\-26 17:08:49 GMT \& \-\- bug fixed in genbank2gff3 for multi\-record handling .Ve .PP This error is for a /trans_splice gene that is hard to handle, and unflattner/genbank2 doesn't .PP .Vb 3 \& From: Chad Matsalla dieselwurks.com> \& Subject: genbank2gff3.PLS and the unflatenner \- Inconsistent order? \& Date: 2005\-07\-15 19:51:48 GMT .Ve .SH "AUTHOR" .IX Header "AUTHOR" Sheldon McKay (mckays@cshl.edu) .PP Copyright (c) 2004 Cold Spring Harbor Laboratory. .SS "\s-1AUTHOR\s0 of hacks for GFF2Chado loading" .IX Subsection "AUTHOR of hacks for GFF2Chado loading" Don Gilbert (gilbertd@indiana.edu)