NAME¶
ipdSummary - Detect DNA base-modifications from kinetic signatures.
DESCRIPTION¶
kineticsTool loads IPDs observed at each position in the genome, and compares
those IPDs to value expected for unmodified DNA, and outputs the result of
this statistical test. The expected IPD value for unmodified DNA can come from
either an
in-silico control or an
amplified control. The in
silico control is trained by PacBio and shipped with the package. It predicts
predicts the IPD using the local sequence context around the current position.
An amplified control dataset is generated by sequencing unmodified DNA with
the same sequence as the test sample. An amplified control sample is usually
generated by whole-genome amplification of the original sample.
Modification Detection¶
The basic mode of kineticsTools does an independent comparison of IPDs at each
position on the genome, for each strand, and emits various statistics to CSV
and GFF (after applying a significance filter).
Modifications Identification¶
- kineticsTools also has a Modification Identification mode that
can decode multi-site IPD 'fingerprints' into a reduced set of calls of
specific modifications. This feature has the following benefits:
- •
- Different modifications occuring on the same base can be distinguished
(for example m5C and m4C)
- •
- The signal from one modification is combined into one statistic, improving
sensitivity, removing extra peaks, and correctly centering the call
OPTIONS¶
Please call this program with
--help to see the available options.
ALGORITHM¶
Synthetic Control¶
Studies of the relationship between IPD and sequence context reveal that most of
the variation in mean IPD across a genome can be predicted from a 12-base
sequence context surrounding the active site of the DNA polymerase. The bounds
of the relevant context window correspond to the window of DNA in contact with
the polymerase, as seen in DNA/polymerase crystal structures. To simplify the
process of finding DNA modifications with PacBio data, the tool includes a
pre-trained lookup table mapping 12-mer DNA sequences to mean IPDs observed in
C2 chemistry.
Filtering and Trimming¶
kineticsTools uses the Mapping QV generated by BLASR and stored in the cmp.h5
file to ignore reads that aren't confidently mapped. The default minimum
Mapping QV required is 10, implying that BLASR has
90\% confidence that
the read is correctly mapped. Because of the range of readlengths inherent in
PacBio dataThis can be changed in using the --mapQvThreshold command line
argument, or via the SMRTPortal configuration dialog for Modification
Detection.
There are a few features of PacBio data that require special attention in order
to acheive good modification detection performance. kineticsTools inspects the
alignment between the observed bases and the reference sequence -- in order
for an IPD measurement to be included in the analysis, the PacBio read
sequence must match the reference sequence for
k around the cognate
base. In the current module
k=1 The IPD distribution at some locus be
thought of as a mixture between the 'normal' incorporation process IPD, which
is sensitive to the local sequence context and DNA modifications and a
contaminating 'pause' process IPD which have a much longer duration (mean
>10x longer than normal), but happen rarely (~1% of IPDs). Note: Our
current understanding is that pauses do not carry useful information about the
methylation state of the DNA, however a more careful analysis may be
warranted. Also note that modifications that drastically increase the Roughly
1% of observed IPDs are generated by pause events. Capping observed IPDs at
the global 99th percentile is motivated by theory from robust hypothesis
testing. Some sequence contexts may have naturally longer IPDs, to avoid
capping too much data at those contexts, the cap threshold is adjusted per
context as follows: capThreshold = max(global99, 5*modelPrediction,
percentile(ipdObservations, 75))
Statistical Testing¶
We test the hypothesis that IPDs observed at a particular locus in the sample
have a longer means than IPDs observed at the same locus in unmodified DNA. If
we have generated a Whole Genome Amplified dataset, which removes DNA
modifications, we use a case-control, two-sample t-test. This tool also
provides a pre-calibrated 'synthetic control' model which predicts the
unmodified IPD, given a 12 base sequence context. In the synthetic control
case we use a one-sample t-test, with an adjustment to account for error in
the synthetic control model.
aligned_reads.cmp.h5¶
A standard cmp.h5 file contain alignments and IPD information supplies the
kinetic data used to perform modification detection. The standard cmp.h5 file
of a SMRTportal jobs is data/aligned_read.cmp.h5.
Reference Sequence¶
The tool requires the reference sequence used to perform alignments. Currently
this must be supplied via the path to a SMRTportal reference repository entry.
OUTPUTS¶
The modification detection tool provides results in a variety of formats
suitable for in-depth statistical analysis, quick reference, and comsumption
by visualization tools such as PacBio SMRTView. Results are generally indexed
by reference position and reference strand. In all cases the strand value
refers to the strand carrying the modification in DNA sample. Remember that
the kinetic effect of the modification is observed in read sequences aligning
to the opposite strand. So reads aligning to the positive strand carry
information about modification on the negative strand and vice versa, but in
this toolkit we alway report the strand containing the putative modification.
modifications.csv¶
The modifications.csv file contains one row for each (reference position,
strand) pair that appeared in the dataset with coverage at least x. x defaults
to 3, but is configurable with '--minCoverage' flag to ipdSummary.py. The
reference position index is 1-based for compatibility with the gff file the R
environment.
Output columns¶
in-silico control mode
|
Column |
Description |
|
refId |
reference sequence ID of this observation |
|
tpl |
1-based template position |
|
strand |
native sample strand where kinetics were generated. '0' is the strand of
the original FASTA, '1' is opposite strand from FASTA |
|
base |
the cognate base at this position in the reference |
|
score |
Phred-transformed pvalue that a kinetic deviation exists at this
position |
|
tMean |
capped mean of normalized IPDs observed at this position |
|
tErr |
capped standard error of normalized IPDs observed at this position
(standard deviation / sqrt(coverage) |
|
modelPrediction |
normalized mean IPD predicted by the synthetic control model for this
sequence context |
|
ipdRatio |
tMean / modelPrediction |
|
coverage |
count of valid IPDs at this position (see Filtering section for
details) |
|
frac |
estimate of the fraction of molecules that carry the modification |
|
fracLow |
2.5% confidence bound of frac estimate |
|
fracUpp |
97.5% confidence bound of frac estimate |
|
case-control mode
|
Column |
Description |
|
refId |
reference sequence ID of this observation |
|
tpl |
1-based template position |
|
strand |
native sample strand where kinetics were generated. '0' is the strand of
the original FASTA, '1' is opposite strand from FASTA |
|
base |
the cognate base at this position in the reference |
|
score |
Phred-transformed pvalue that a kinetic deviation exists at this
position |
|
caseMean |
mean of normalized case IPDs observed at this position |
|
controlMean |
mean of normalized control IPDs observed at this position |
|
caseStd |
standard deviation of case IPDs observed at this position |
|
controlStd |
standard deviation of control IPDs observed at this position |
|
ipdRatio |
tMean / modelPrediction |
|
testStatistic |
t-test statistic |
|
coverage |
mean of case and control coverage |
|
controlCoverage |
count of valid control IPDs at this position (see Filtering section for
details) |
|
caseCoverage |
count of valid case IPDs at this position (see Filtering section for
details) |
|
modifications.gff¶
The modifications.gff is compliant with the GFF Version 3 specification (
http://www.sequenceontology.org/gff3.shtml). Each template position /
strand pair whose p-value exceeds the pvalue threshold appears as a row. The
template position is 1-based, per the GFF spec. The strand column refers to
the strand carrying the detected modification, which is the opposite strand
from those used to detect the modification. The GFF confidence column is a
Phred-transformed pvalue of detection.
Note on genome browser compatibility
The modifications.gff file will not work directly with most genome browsers. You
will likely need to make a copy of the GFF file and convert the _seqid_
columns from the generic 'ref0000x' names generated by PacBio, to the FASTA
headers present in the original reference FASTA file. The mapping table is
written in the header of the modifications.gff file in
#sequence-header
tags. This issue will be resolved in the 1.4 release of kineticsTools.
The auxiliary data column of the GFF file contains other statistics which may be
useful downstream analysis or filtering. In particular the coverage level of
the reads used to make the call, and +/- 20bp sequence context surrounding the
site.
|
Column |
Description |
|
seqid |
Fasta contig name |
|
source |
Name of tool -- 'kinModCall' |
|
type |
Modification type -- in identification mode this will be m6A, m4C, or
m5C for identified bases, or the generic tag 'modified_base' if a kinetic
event was detected that does not match a known modification signature |
|
start |
Modification position on contig |
|
end |
Modification position on contig |
|
score |
Phred transformed p-value of detection - this is the single-site
detection p-value |
|
strand |
Sample strand containing modification |
|
phase |
Not applicable |
|
attributes |
Extra fields relevant to base mods. IPDRatio is traditional IPDRatio,
context is the reference sequence -20bp to +20bp around the modification,
and coverage level is the number of IPD observations used after Mapping QV
filtering and accuracy filtering. If the row results from an identified
modification we also include an identificationQv tag with the from the
modification identification procedure. identificationQv is the
phred-transformed probability of an incorrect identification, for bases
that were identified as having a particular modification. frac, fracLow,
fracUp are the estimated fraction of molecules carrying the modification,
and the 5% confidence intervals of the estimate. The methylated fraction
estimation is a beta-level feature, and should only be used for
exploratory purposes. |
|
motifs.gff¶
If the Motif Finder tool is run, it will generate motifs.gff, which a
reprocessed version of modifications.gff with the following changes. If a
detected modification occurs on a motif detected by the motif finder, the
modification is annotated with motif data. An attribute 'motif' is added
containing the motif string, and an attribute 'id' is added containing the
motif id, which is the motif string for unpaired motifs or
'motifString1/motifString2' for paired motifs. If a motif instance exists in
the genome, but was not detected in modifications.gff, an entry is added to
motifs.gff, indicating the presence of that motif and the kinetics that were
observed at that site.
motif_summary.csv¶
If the Motif Finder tool is run, motif_summary.csv is generated, summarizing the
modified motifs discovered by the tool. The CSV contains one row per detected
motif, with the following columns
|
Column |
Description |
|
motifString |
Detected motif sequence |
|
centerPos |
Position in motif of modification (0-based) |
|
fraction |
Fraction of instances of this motif with modification QV above the QV
threshold |
|
nDetected |
Number of instances of this motif with above threshold |
|
nGenome |
Number of instances of this motif in reference sequence |
|
groupTag |
A string idetifying the motif grouping. For paired motifs this is
"<motifString1>/<motifString2>", For unpaired motifs
this equals motifString |
|
partnerMotifString |
motifString of paired motif (motif with reverse-complementary
motifString) |
|
meanScore |
Mean Modification Qv of detected instances |
|
meanIpdRatio |
Mean IPD ratio of detected instances |
|
meanCoverage |
Mean coverage of detected instances |
|
objectiveScore |
Objective score of this motif in the motif finder algorithm |
|
SEE ALSO¶
summarizeModifications(1)