.\"t .\" Automatically generated by Pandoc 2.2.1 .\" .TH "PBGFF" "5" "August 2014" "2.1" "Bioinformatics formats" .hy .SH NAME .PP \f[B]pbgff\f[] \- Pacific Biosciences extended GFFv3 file format .SH DESCRIPTION .PP As of this version, \f[C]variants.gff\f[] is our primary variant call file format. The \f[C]variants.gff\f[] file is based on the GFFv3 standard (http://www.sequenceontology.org/gff3.shtml). The GFFv3 standard describes a tab\-delimited plain\-text file meta\-format for describing genomic \[lq]features.\[rq] Each gff file consists of some initial \[lq]header\[rq] lines supplying metadata, and then a number of \[lq]feature\[rq] lines providing information about each identified variant. .SS The GFF Coordinate System .PP All coordinates in GFF files are 1\-based, and all intervals \f[C]start,\ end\f[] are understood as including both endpoints. .SS Headers .PP The \f[C]variants.gff\f[] file begins with a block of metadata headers, which looks like the following: .IP .nf \f[C] ##gff\-version\ 3 ##pacbio\-variant\-version\ 2.1 ##date\ Tue\ Feb\ 28\ 17:44:18\ 2012 ##feature\-ontology\ http://song.cvs.sourceforge.net/*checkout*/song/ontology/sofa.obo?revision=1.12 ##source\ GenomicConsensus\ v0.1.0 ##source\-commandline\ callVariants.py\ \-\-algorithm=plurality\ aligned_reads.cmp.h5\ \-r\ spinach.fasta\ \-o\ variants.gff ##source\-alignment\-file\ /home/popeye/data/aligned_reads.cmp.h5 ##source\-reference\-file\ /home/popeye/data/spinach.fasta ##sequence\-region\ EGFR_Exon_23\ 1\ 189 ##sequence\-header\ EGFR_Exon_24\ 1\ 200 \f[] .fi .PP The \f[C]source\f[] and \f[C]source\-commandline\f[] describe the name and version of the software generating the file. \f[C]pacbio\-variant\-version\f[] reflects the specification version that the file contents should adhere to. .PP The \f[C]sequence\-region\f[] headers describe the names and extents of the reference groups (i.e.\ reference contigs) that will be referred to in the file. The names are the same as the full FASTA header. .PP \f[C]source\-alignment\-file\f[] and \f[C]source\-reference\-file\f[] record absolute paths to the primary input files. .SS Feature lines .PP After the headers, each line in the file describes a genomic \f[I]feature\f[]; in this file, all the features are potential variants flagged by the variant caller. The general format of a variant line is a 9\-column (tab\-delimited) record, where the first 8 columns correspond to fixed, predefined entities in the GFF standard, while the 9th column is a flexible semicolon\-delimited list of mappings \f[C]key=value\f[]. .PP The 8 predefined columns are as follows: .PP .TS tab(@); lw(7.7n) lw(8.6n) lw(32.6n) lw(19.2n). T{ Column Number T}@T{ Name T}@T{ Description T}@T{ Example T} T{ 1 T}@T{ seqId T}@T{ The full FASTA header for the reference contig. T}@T{ \f[C]lambda_NEB3011\f[] T} T{ 2 T}@T{ source T}@T{ (unused; always populated with \f[C]\&.\f[]) T}@T{ \f[C]\&.\f[] T} T{ 3 T}@T{ type T}@T{ the type of variant. One of \f[C]insertion\f[], \f[C]deletion\f[], or \f[C]substitution\f[]. T}@T{ \f[C]substitution\f[] T} T{ 4 T}@T{ start T}@T{ 1\-based start coordinate for the variant. T}@T{ 200 T} T{ 5 T}@T{ end T}@T{ 1\-based end coordinate for the variant. start<=end always obtains, regardless of strand. T}@T{ 215 T} T{ 6 T}@T{ score T}@T{ unused; populated with \f[C]\&.\f[] T}@T{ \f[C]\&.\f[] T} T{ 7 T}@T{ strand T}@T{ unused; populated with \f[C]\&.\f[] T}@T{ \f[C]\&.\f[] T} T{ 8 T}@T{ phase T}@T{ unused; populated with \f[C]\&.\f[] T}@T{ \f[C]\&.\f[] T} .TE .PP The attributes in the 9th (final) column are as follows: .PP .TS tab(@); lw(17.5n) lw(31.1n) lw(20.4n). T{ Key T}@T{ Description T}@T{ Example value T} T{ \f[C]coverage\f[] T}@T{ the read coverage of the variant site (not the variant itself) T}@T{ \f[C]42\f[] T} T{ \f[C]confidence\f[] T}@T{ the phred\-scaled probability that the variant is real, rounded to the nearest integer and truncated at 93 T}@T{ \f[C]37\f[] T} T{ \f[C]reference\f[] T}@T{ the reference base or bases for the variant site. May be \f[C]\&.\f[] to represent a zero\-length substring (for insertion events) T}@T{ \f[C]T\f[], \f[C]\&.\f[] T} T{ \f[C]variantSeq\f[] T}@T{ the read base or bases corresponding to the variant. \f[C]\&.\f[] encodes a zer\-length string, as for a deletion. T}@T{ .TP .B \f[C]T\f[] (haploid); .RS .RE .TP .B \f[C]T/C\f[], \f[C]T/.\f[] (heterozygous) .RS .RE T} T{ \f[C]frequency\f[] T}@T{ the read coverage of the variant itself; for heterozygous variants, the frequency of both observed alleles. This is an optional field. T}@T{ .TP .B \f[C]13\f[] (haploid) .RS .RE .TP .B \f[C]15/12\f[] (heterozygous) .RS .RE T} .TE .PP The attributes may be present in any order. .PP The four types of variant we support are as follows. \f[I](Recall that the field separator is a tab, not a space.)\f[] .IP "1." 3 Insertion. Examples: .RS 4 .IP .nf \f[C] ref00001\ .\ insertion\ 8\ 8\ .\ .\ .\ reference=.;variantSeq=G;confidence=22;coverage=18;frequency=10 ref00001\ .\ insertion\ 19\ 19\ .\ .\ .\ reference=.;variantSeq=G/.;confidence=22;coverage=18;frequency=7/5 \f[] .fi .RE .RS .PP For insertions, start==end, and the insertion event is understood as taking place \f[I]following\f[] the reference position start. .RE .IP "2." 3 Deletion. Examples: .RS 4 .IP .nf \f[C] ref00001\ .\ deletion\ 348\ 349\ .\ .\ .\ reference=G;variantSeq=.;confidence=39;coverage=25;frequency=20 ref00001\ .\ deletion\ 441\ 443\ .\ .\ .\ reference=GG;variantSeq=GG/.;confidence=39;coverage=25;frequency=8/8 \f[] .fi .RE .IP "3." 3 Substitution. Examples: .RS 4 .IP .nf \f[C] ref000001\ .\ substitution\ 100\ 102\ .\ .\ .\ reference=GGG;variantSeq=CCC;confidence=50;coverage=20;frequency=16 ref000001\ .\ substitution\ 200\ 201\ .\ .\ .\ reference=G;variantSeq=G/C;confidence=50;coverage=20;frequency=10/6 \f[] .fi .RE .SS Compression .PP The gff metaformat is verbose, so for practical purposes we will gzip encode \f[C]variants.gff\f[] files as \f[C]variants.gff.gz\f[]. Consumers of the variant file should be able to read it in either form. .SH SEE ALSO .PP The VCF and BED standards describe variant\-call specific file formats. We can currently translate variants.gff files to these formats, but they are not the primary output of the variant callers. .SH AUTHORS Pacific Biosciences .