.\" Man page generated from reStructuredText. . .TH VARIANTCALLER 1 "February 2016" "" "" .SH NAME variantCaller \- variant-calling algorithms for PacBio sequencing data . .nr rst2man-indent-level 0 . .de1 rstReportMargin \\$1 \\n[an-margin] level \\n[rst2man-indent-level] level margin: \\n[rst2man-indent\\n[rst2man-indent-level]] - \\n[rst2man-indent0] \\n[rst2man-indent1] \\n[rst2man-indent2] .. .de1 INDENT .\" .rstReportMargin pre: . RS \\$1 . nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin] . nr rst2man-indent-level +1 .\" .rstReportMargin post: .. .de UNINDENT . RE .\" indent \\n[an-margin] .\" old: \\n[rst2man-indent\\n[rst2man-indent-level]] .nr rst2man-indent-level -1 .\" new: \\n[rst2man-indent\\n[rst2man-indent-level]] .in \\n[rst2man-indent\\n[rst2man-indent-level]]u .. .SH SYNOPSIS .sp \fBvariantCaller.py\fP is invoked from the command line. For example, a simple invocation is: .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C variantCaller.py \-j8 \-\-algorithm=quiver \e \-r lambdaNEB.fa \e \-o variants.gff \e aligned_reads.cmp.h5 .ft P .fi .UNINDENT .UNINDENT .sp which requests that variant calling proceed, \- using 8 worker processes, \- employing the \fBquiver\fP algorithm, \- taking input from the file \fBaligned_reads.cmp.h5\fP, \- using the FASTA file \fBlambdaNEB.fa\fP as the reference, \- and writing output to \fBvariants.gff\fP (see \fBpbgff\fP(5)). .sp A particularly useful option is \fB\-\-referenceWindow/\-w\fP: this option allows the user to direct the tool to perform variant calling exclusively on a \fIwindow\fP of the reference genome, where the .SH OPTIONS .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C variantCaller.py \-\-help .ft P .fi .UNINDENT .UNINDENT .sp will provide a help message explaining all available options. .SH NOTES .SS Input and output .sp \fBvariantCaller.py\fP requires two input files: .INDENT 0.0 .IP \(bu 2 A file of reference\-aligned reads in PacBio\(aqs standard cmp.h5 format; .IP \(bu 2 A FASTA file that has been processed by ReferenceUploader. .UNINDENT .sp The tool\(aqs output is formatted in the GFF format, as described in (how to link to other file?). External tools can be used to convert the GFF file to a VCF or BED file\-\-\-two other standard interchange formats for variant calling. .sp \fBNOTE:\fP .INDENT 0.0 .INDENT 3.5 \fBInput cmp.h5 file requirements\fP .sp \fBvariantCaller.py\fP requires its input cmp.h5 file to be be sorted. An unsorted file can be sorting using the tool \fBcmpH5Sort.py\fP\&. .sp The \fBquiver\fP(1) algorithm in \fBvariantCaller\fP requires its input cmp.h5 file to have the following \fIpulse features\fP: .IP "System Message: ERROR/3 (doc/VariantCallerFunctionalSpecification.rst:, line 69)" Unexpected indentation. .INDENT 0.0 .INDENT 3.5 .INDENT 0.0 .IP \(bu 2 \fBInsQV\fP, .IP \(bu 2 \fBSubsQV\fP, .IP \(bu 2 \fBDelQV\fP, .IP \(bu 2 \fBDelTag\fP, .IP \(bu 2 \fBMergeQV\fP\&. .UNINDENT .UNINDENT .UNINDENT .sp The \fBplurality\fP(1) algorithm can be run on cmp.h5 files that lack these features. .UNINDENT .UNINDENT .sp The input file is the main argument to \fBvariantCaller.py\fP, while the output file is provided as an argument to the \fB\-o\fP flag. For example, .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C variantCaller.py aligned_reads.cmp.h5 \-r lambda.fa \-o variants.gff .ft P .fi .UNINDENT .UNINDENT .sp will read input from \fBaligned_reads.cmp.h5\fP, using the reference \fBlambda.fa\fP, and send output to the file \fBvariants.gff\fP\&. The extension of the filename provided to the \fB\-o\fP flag is meaningful, as it determines the output file format. The file formats presently supported, by extension, are .INDENT 0.0 .TP .B \fB\&.gff\fP GFFv3 format .TP .B \fB\&.txt\fP a simplified human readable format used primarily by the developers .UNINDENT .sp If the \fB\-o\fP flag is not provided, the default behavior is to output to a \fBvariants.gff\fP in the current directory. .sp \fBNOTE:\fP .INDENT 0.0 .INDENT 3.5 \fBvariantCaller.py\fP does \fBnot\fP modify its input cmp.h5 file in any way. This is in contrast to previous variant callers in use at PacBio, which would write a \fIconsensus\fP dataset to the input cmp.h5 file. .UNINDENT .UNINDENT .SS Available algorithms .sp At this time there are two algorithms available for variant calling: \fBplurality\fP and \fBquiver\fP\&. .sp \fBPlurality\fP is a simple and very fast procedure that merely tallies the most frequent read base or bases found in alignment with each reference base, and reports deviations from the reference as potential variants. .sp \fBQuiver\fP is a more complex procedure based on algorithms originally developed for CCS. Quiver leverages the quality values (QVs) provided by upstream processing tools, which provide insight into whether insertions/deletions/substitutions were deemed likely at a given read position. Use of \fBquiver\fP requires the \fBConsensusCore\fP library as well as trained parameter set, which will be loaded from a standard location (TBD). Quiver can be thought of as a QV\-aware local\-realignment procedure. .sp Both algorithms are expected to converge to \fIzero\fP errors (miscalled variants) as coverage increases; however \fBquiver\fP should converge much faster (i.e., fewer errors at low coverage), and should provide greater variant detection power at a given error level. .SS Confidence values .sp Both \fIquiver\fP and \fIplurality\fP make a confidence metric available for every position of the consensus sequence. The confidence should be interpreted as a phred\-transformed posterior probability that the consensus call is incorrect; i.e. .INDENT 0.0 .INDENT 3.5 .sp .nf .ft C QV = \-10 \elog_{10}(p_{err}) .ft P .fi .UNINDENT .UNINDENT .sp \fBvariantCaller.py\fP clips reported QV values at 93\-\-\-larger values cannot be encoded in a standard FASTQ file. .SS Chemistry specificity .sp The Quiver algorithm parameters are trained per\-chemistry. SMRTanalysis software loads metadata into the \fIcmp.h5\fP to indicate the chemistry used per movie. Quiver sees this table and automatically chooses the appropriate parameter set to use. This selection can be overridden by a command line flag. .sp When multiple chemistries are represented in the reads in a \fIcmp.h5\fP, Quiver will model each read appropriately using the parameter set for its chemistry, thus yielding optimal results. .SS Performance Requirements .sp \fBvariantCaller.py\fP performs variant calling in parallel using multiple processes. Work splitting and inter\-process communication are handled using the Python \fBmultiprocessing\fP module. Work can be split among an arbitrary number of processes (using the \fB\-j\fP command\-line flag), but for best performance one should use no more worker processes than there are CPUs in the host computer. .sp The running time of the \fIplurality\fP algorithm should not exceed the runtime of the BLASR process that produced the cmp.h5. The running time of the \fIquiver\fP algorithm should not exceed 4x the runtime of BLASR. .sp The amount of core memory (RAM) used among all the python processes launched by a \fBvariantCaller.py\fP run should not exceed the size of the uncompressed input \fB\&.cmp.h5\fP file. .SH SEE ALSO .sp \fBquiver\fP(1) \fBplurality\fP(1) \fBpbgff\fP(5) \fBblasr\fP(1) .\" Generated by docutils manpage writer. .