.\"t
.\" Automatically generated by Pandoc 2.2.1
.\"
.TH "GENOMICCONSENSUS" "7" "" "" ""
.hy
.SH What are EviCons? GenomicConsensus? Quiver? Plurality?
.PP
\f[B]GenomicConsensus\f[] is the current PacBio consensus and variant
calling suite.
It contains a main driver program, \f[C]variantCaller\f[], which
provides two consensus/variant calling algorithms: \f[B]Arrow\f[] and
\f[B]Quiver\f[].
These algorithms can be run by calling
\f[C]variantCaller\ \-\-algorithm=[arrow|quiver|plurality]\f[] or by
going through the convenience wrapper scripts \f[C]quiver\f[] and
\f[C]arrow\f[].
.TP
.B \f[B]EviCons\f[] was the previous generation PacBio variant caller (removed
in software release v1.3.1).
.RS
.RE
.PP
Separate packages called \f[B]ConsensusCore\f[] and
\f[B]ConsensusCore2\f[] are C++ libraries where all the computation
behind Quiver and Arrow are done, respectively.
This is transparent to the user after installation.
.SH What is Plurality?
.PP
\f[B]Plurality\f[] is a very simple variant calling algorithm: it stacks
up the aligned reads (alignment as produced by BLASR, or alternate
mapping tool), and for each column under a reference base, calls the
most abundant (i.e., the plurality) read base (or bases, or deletion) as
the consensus at that reference position.
.SH Why is Plurality a weak algorithm?
.PP
Plurality does not perform any local realignment.
This means it is heavily biased by the alignment produced by the mapper
(BLASR, typically).
It also means that it is insensitive at detecting indels.
Consider this example:
.IP
.nf
\f[C]
Reference\ \ \ \ AAAA
\ \ \ \ \ \ \ \ \ \ \ \ \ \-\-\-\-
\ \ Aligned\ \ \ \ A\-AA
\ \ \ \ reads\ \ \ \ AA\-A
\ \ \ \ \ \ \ \ \ \ \ \ \ \-AAA
\ \ \ \ \ \ \ \ \ \ \ \ \ \-\-\-\-
Plurality\ \ \ \ AAAA
consensus
\f[]
.fi
.PP
Note here that every read has a deletion and the correct consensus call
would be \[lq]AAA\[rq], but due to the mapper's freedom in
gap\-placement at the single\-read level, the plurality sequence is
\[lq]AAAA\[rq]\[em]so the deletion is missed.
Local realignment, which plurality does not do, but which could be
considered as implicit in the Quiver algorithm, essentially pushes the
gaps here to the same column, thus identifying the deletion.
While plurality could be adjusted to use a simple \[lq]gap
normalizing\[rq] realignment, in practice noncognate extras (spurious
non\-homopolymer base calls) in the midst of homopolymer runs pose
challenges.
.SH What is Quiver?
.PP
\f[B]Quiver\f[] is a more sophisticated algorithm that finds the maximum
quasi\-likelihood template sequence given PacBio reads of the template.
PacBio reads are modeled using a conditional random field approach that
scores the quasi\-likelihood of a read given a template sequence.
In addition to the base sequence of each read, Quiver uses several
additional \f[I]QV\f[] covariates that the basecaller provides.
Using these covariates provides additional information about each read,
allowing more accurate consensus calls.
.PP
Quiver does not use the alignment provided by the mapper (BLASR,
typically), except for determining how to group reads together at a
macro level.
It implicitly performs its own realignment, so it is highly sensitive to
all variant types, including indels\[em]for example, it resolves the
example above with ease.
.PP
The name \f[B]Quiver\f[] reflects a consensus\-calling algorithm that is
QV\-aware.
.PP
We use the lowercase \[lq]quiver\[rq] to denote the quiver \f[I]tool\f[]
in GenomicConsensus, which applies the Quiver algorithm to mapped reads
to derive sequence consensus and variants.
.PP
Quiver is described in detail in the supplementary material to the HGAP
paper (http://www.nature.com/nmeth/journal/v10/n6/full/nmeth.2474.html).
.SH What is Arrow?
.PP
Arrow is a newer model intended to supersede Quiver in the near future.
The key differences from Quiver are that it uses an HMM model instead of
a CRF, it computes true likelihoods, and it uses a smaller set of
covariates.
We expect a whitepaper on Arrow to be available soon.
.PP
We use the lowercase \[lq]arrow\[rq] to denote the arrow \f[I]tool\f[],
which applies the Arrow algorithm to mapped reads to derive sequence
consensus and variants.
.SH How do I run quiver/arrow?
.PP
For general instructions on installing and running, see the
HowTo (./HowTo.rst) document.
.SH What is the output from quiver/arrow?
.PP
There are three output files from the GenomicConsensus tools:
.IP "1." 3
A consensus \f[I]FASTA\f[] file containing the consensus sequence
.IP "2." 3
A consensus \f[I]FASTQ\f[] file containing the consensus sequence with
quality annotations
.IP "3." 3
A variants \f[I]GFF\f[] file containing a filtered, annotated list of
variants identified
.PP
It is important to note that the variants included in the output
variants GFF file are \f[I]filtered\f[] by coverage and quality, so not
all variants that are apparent in comparing the reference to the
consensus FASTA output will correspond to variants in the output
variants GFF file.
.PP
To enable all output files, the following can be run (for example):
.IP
.nf
\f[C]
%\ quiver\ \-j16\ aligned_reads.cmp.h5\ \-r\ ref.fa\ \\
\ \-o\ consensus.fa\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \\
\ \-o\ consensus.fq\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \\
\ \-o\ variants.gff
\f[]
.fi
.PP
The extension is used to determine the output file format.
.SH What does it mean that quiver consensus is \f[I]de novo\f[]?
.PP
Quiver's consensus is \f[I]de novo\f[] in the sense that the reference
and the reference alignment are not used to inform the consensus output.
Only the reads factor into the determination of the consensus.
.PP
The only time the reference sequence is used to make consensus calls
\-when the \f[C]\-\-noEvidenceConsensusCall\f[] flag is set to
\f[C]reference\f[] or \f[C]lowercasereference\f[] (the default)\- is
when there is no effective coverage in a genomic window, so Quiver has
no evidence for computing consensus.
One can set \f[C]\-\-noEvidenceConsensusCall=nocall\f[] to avoid using
the reference even in zero coverage regions.
.SH What is the expected quiver accuracy?
.PP
Quiver's expected accuracy is a function of coverage and chemistry.
The C2 chemistry (no longer available), P6\-C4 and P4\-C2 chemistries
provide the most accuracy.
Nominal consensus accuracy levels are as follows:
.PP
.TS
tab(@);
lw(11.7n) lw(32.1n).
T{
Coverage +
T}@T{
Expected consensus accuracy
\[em]\[em]\[em]\[em]\[em]\[em]+\[em]\[em]\[em]\[em]+ C2, P4\-C2, P6\-C4
| P5\-C3
T}
_
T{
10x
T}@T{
> Q30 | > Q30
T}
T{
20x
T}@T{
> Q40 | > Q40
T}
T{
40x
T}@T{
> Q50 | > Q45
T}
T{
60\-80x
T}@T{
~ Q60 | > Q55
T}
.TE
.PP
The \[lq]Q\[rq] values referred to are Phred\-scaled quality values:
.PP
.RS
\f[I]q\f[] =  − 10log~10~\f[I]p\f[]~\f[I]e\f[]\f[I]r\f[]\f[I]r\f[]\f[I]o\f[]\f[I]r\f[]~
.RE
.PP
for instance, Q50 corresponds to a p_error of 0.00001\[em]an accuracy of
99.999%.
These accuracy expectations are based on routine validations performed
on multiple bacterial genomes before each chemistry release.
.SH What is the expected accuracy from arrow
.PP
arrow achieves similar accuracy to quiver.
Numbers will be published soon.
.SH What are the residual errors after applying quiver?
.PP
If there are errors remaining applying Quiver, they will almost
invariably be homopolymer run\-length errors (insertions or deletions).
.SH Does quiver/arrow need to know what sequencing chemistry was used?
.PP
At present, the Quiver model is trained per\-chemistry, so it is very
important that Quiver knows the sequencing chemistries used.
.PP
If SMRT Analysis software was used to build the cmp.h5 or BAM input
file, the cmp.h5 will be loaded with information about the sequencing
chemistry used for each SMRT Cell, and GenomicConsensus will
automatically identify the right parameters to use.
.PP
If custom software was used to build the cmp.h5, or an override of
Quiver's autodetection is desired, then the chemistry or model must be
explicitly entered.
For example:
.IP
.nf
\f[C]
%\ quiver\ \-p\ P4\-C2\ ...
%\ quiver\ \-p\ P4\-C2.AllQVsMergingByChannelModel\ ...
\f[]
.fi
.SH Can a mix of chemistries be used in a cmp.h5 file for quiver/arrow?
.PP
Yes! GenomicConsensus tools automatically see the chemistry
\f[I]per\-SMRT Cell\f[], so it can figure out the right parameters for
each read and model them appropriately.
.SH What chemistries and chemistry mixes are supported?
.PP
For Quiver: all PacBio RS chemistries are supported.
Chemistry mixtures of P6\-C4, P4\-C2, P5\-C3, and C2 are supported.
.PP
For Arrow: the RS chemistry P6\-C4, and all PacBio Sequel chemistries
are supported.
Mixes of these chemistries are supported.
.SH What are the QVs that the Quiver model uses?
.PP
Quiver uses additional QV tracks provided by the basecaller.
These QVs may be looked at as little breadcrumbs that are left behind by
the basecaller to help identify positions where it was likely that
errors of a given type occurred.
Formally, the QVs for a given read are vectors of the same length as the
number of bases called; the QVs used are as follows:
.RS
.IP \[bu] 2
DeletionQV
.IP \[bu] 2
InsertionQV
.IP \[bu] 2
MergeQV
.IP \[bu] 2
SubstitutionQV
.IP \[bu] 2
DeletionTag
.RE
.PP
To find out if your cmp.h5 file is loaded with these QV tracks, run the
command :
.IP
.nf
\f[C]
%\ h5ls\ \-rv\ aligned_reads.cmp.h5
\f[]
.fi
.PP
and look for the QV track names in the output.
If your cmp.h5 file is lacking some of these tracks, Quiver will still
run, though it will issue a warning that its performance will be
suboptimal.
.SH Why is quiver/arrow making errors in some region?
.PP
The most likely cause for \f[I]true\f[] errors made by these tools is
that the coverage in the region was low.
If there is 5x coverage over a 1000\-base region, then 10 errors in that
region can be expected.
.PP
It is important to understand that the effective coverage available to
quiver/arrow is not the full coverage apparent in plots\[em]the tools
filter out ambiguously mapped reads by default.
The remaining coverage after filtering is called the /effective
coverage/.
See the next section for discussion of MapQV.
.PP
If you have verified that there is high effective coverage in the region
in question, it is highly possible\[em]given the high accuracy quiver
and arrow can achieve\[em]that the apparent errors actually reflect true
sequence variants.
Inspect the FASTQ output file to ensure that the region was called at
high confidence; if an erroneous sequence variant is being called at
high confidence, please report a bug to us.
.SH What does Quiver do for genomic regions with no effective coverage?
.PP
For regions with no effective coverage, no variants are outputted, and
the FASTQ confidence is 0.
.PP
The output in the FASTA and FASTQ consensus sequence tracks is dependent
on the setting of the \f[C]\-\-noEvidenceConsensusCall\f[] flag.
Assuming the reference in the window is \[lq]ACGT\[rq], the options are:
.PP
.TS
tab(@);
lw(45.7n) lw(10.7n).
T{
\f[C]\-\-noEvidenceConsensusCall=...\f[]
T}@T{
Consensus output
T}
_
T{
\f[C]nocall\f[] (default in 1.4)
T}@T{
NNNN
T}
T{
\f[C]reference\f[]
T}@T{
ACGT
T}
T{
\f[C]lowercasereference\f[] (new post 1.4, and the default)
T}@T{
acgt
T}
.TE
.SH What is MapQV and why is it important?
.PP
MapQV is a single scalar Phred\-scaled QV per aligned read that reflects
the mapper's degree of certainty that the read aligned to \f[I]this\f[]
part of the reference and not some other.
Unambigously mapped reads will have a high MapQV (typically 255), while
a read that was equally likely to have come from two parts of the
reference would have a MapQV of 3.
.PP
MapQV is pretty important when you want highly accurate variant calls.
Quiver and Plurality both filter out aligned reads with a MapQV below 20
(by default), so as not to call a variant using data of uncertain
genomic origin.
.PP
This can be problematic if using quiver/arrow to get a consensus
sequence.
If the genome of interest contains long (relative to the library insert
size) highly\-similar repeats, the effective coverage (after MapQV
filtering) may be reduced in the repeat regions\[em]this is termed these
MapQV dropouts.
If the coverage is sufficiently reduced in these regions, quiver/arrow
will not call consensus in these regions\[em]see What do quiver/arrow do
for genomic regions with no effective coverage?_.
.PP
If you want to use ambiguously mapped reads in computing a consensus for
a denovo assembly, the MapQV filter can be turned off entirely.
In this case, the consensus for each instance of a genomic repeat will
be calculated using reads that may actually be from other instances of
the repeat, so the exact trustworthiness of the consensus in that region
may be suspect.
The next section describes how to disable the MapQV filter.
.PP
How can the MapQV filter be turned off and when should it be?
\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[en]The
MapQV filter can be disabled using the flag
\f[C]\-\-mapQvThreshold=0\f[] (shorthand: \f[C]\-m=0\f[]).
If running a quiver/arrow job via SMRT Portal, this can be done by
unchecking the \[lq]Use only unambiguously mapped reads\[rq] option.
Consider this in de novo assembly projects, but it is not recommended
for variant calling applications.
.SH How can variant calls made by quiver/arrow be inspected or
validated?
.PP
When in doubt, it is easiest to inspect the region in a tool like SMRT
View, which enables you to view the reads aligned to the region.
Deletions and substitutions should be fairly easy to spot; to view
insertions, right\-click on the reference base and select \[lq]View
Insertions Before\&...\[rq].
.SH What are the filtering parameters that quiver/arrow use?
.PP
The available options limit read coverage, filters reads by MapQV, and
filters variants by quality and coverage.
.IP \[bu] 2
The overall read coverage used to call consensus in every window is 100x
by default, but can be changed using \f[C]\-X=value\f[].
.IP \[bu] 2
The MapQV filter, by default, removes reads with MapQV < 20.
This is configured using \f[C]\-\-mapQvThreshold=value\f[] /
\f[C]\-m=value\f[]
.IP \[bu] 2
Variants are only called if the read coverage of the site exceeds 5x, by
default\[em]this is configurable using \f[C]\-x=value\f[].
Further, they will not be called if the confidence (Phred\-scaled) does
not exceed 40\[em]configurable using \f[C]\-q=value\f[].
.PP
What happens when the sample is a mixture, or diploid?
\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[em]\[en]At
present, quiver/arrow assume a haploid sample, and the behavior of on
sample mixtures or diploid/polyploid samples is \f[I]undefined\f[].
The program will not crash, but the output results are not guaranteed to
accord with any one of the haplotypes in the sample, as opposed to a
potential patchwork.
.SH Why would I want to \f[I]iterate\f[] the mapping+(quiver/arrow)
process?
.PP
Some customers using quiver for polishing highly repetitive genomes have
found that if they take the consensus FASTA output of quiver, use it as
a new reference, and then perform mapping and Quiver again to get a new
consensus, they get improved results from the second round of quiver.
.PP
This can be explained by noting that the output of the first round of
quiver is more accurate than the initial draft consensus output by the
assembler, so the second round's mapping to the quiver consensus can be
more sensitive in mapping reads from repetitive regions.
This can then result in improved consensus in those repetitive regions,
because the reads have been assigned more correctly to their true
genomic loci.
However there is also a possibility that the potential shifting of reads
around from one rounds' mapping to the next might alter borderline (low
confidence) consensus calls even away from repetitive regions.
.PP
We recommend the (mapping+quiver) iteration for customers polishing
repetitive genomes, and it could also prove useful for resequencing
applications.
However we caution that this is very much an \f[I]exploratory\f[]
procedure and we make no guarantees about its performance.
In particular, borderline consensus calls can change when the procedure
is iterated, and the procedure is \f[I]not\f[] guaranteed to be
convergent.
.SH Is iterating the (mapping+quiver/arrow) process a convergent
procedure?
.PP
We have seen many examples where (mapping+quiver), repeated many times,
is evidently \f[I]not\f[] a convergent procedure.
For example, a variant call may be present in iteration n, absent in
n+1, and then present again in n+2.
It is possible for subtle changes in mapping to change the set of reads
examined upon inspecting a genomic window, and therefore result in a
different consensus sequence there.
We expect this to be the case primarily for \[lq]borderline\[rq] (low
confidence) base calls.
.SH SEE ALSO
.PP
\f[B]variantCaller\f[](1)