NAME¶
theseus - Maximum likelihood, multiple simultaneous superpositions with
statistical analysis
SYNOPSIS¶
theseus [options]
pdbfile1 [pdbfile2 ...]
and
theseus_align [options] -f
pdbfile1 [pdbfile2 ...]
DESCRIPTION¶
Theseus superposes a set of macromolecular structures simultaneously
using the method of maximum likelihood (ML), rather than the conventional
least-squares criterion.
Theseus assumes that the structures are
distributed according to a matrix Gaussian distribution and that the
eigenvalues of the atomic covariance matrix are hierarchically distributed
according to an inverse gamma distribution. This ML superpositioning model
produces much more accurate results by essentially downweighting variable
regions of the structures and by correcting for correlations among atoms.
Theseus operates in two main modes: (1) a mode for superimposing
structures with identical sequences and (2) a mode for structures with
different sequences but similar structures:
- (1) A mode for superpositioning macromolecules with identical sequences
and numbers of residues, for instance, multiple models in an NMR family or
multiple structures from different crystal forms of the same protein.
- In this mode, Theseus will read every model in every file on the
command line and superpose them.
- Example:
- theseus 1s40.pdb
- In the above example, 1s40.pdb is a pdb file of 10 NMR models.
- (2) An ``alignment'' mode for superpositioning structures with different
sequences, for example, multiple structures of the cytochrome c protein
from different species or multiple mutated structures of hen egg white
lysozyme.
- This mode requires the user to supply a sequence alignment file of the
structures being superpositioned (see option -A and ``FILE
FORMATS'' below). Additionally, it may be necessary to supply a mapfile
that tells theseus which PDB structure files correspond to which
sequences in the alignment (see option -M and ``FILE FORMATS''
below). The mapfile is unnecessary if the sequence names and corresponding
pdb filenames are identical. In this mode, if there are multiple
structural models in a PDB file, theseus only reads the first model
in each file on the command line. In other words, theseus treats
the files on the command line as if there were only one structure per
file.
- Example 1:
- theseus -A cytc.aln -M cytc.filemap d1cih__.pdb d1csu__.pdb
d1kyow_.pdb
- In the above example, d1cih__.pdb, d1csu__.pdb, and d1kyow_.pdb are pdb
files of cytochrome c domains from the SCOP database.
- Example 2:
- theseus_align -f d1cih__.pdb d1csu__.pdb d1kyow_.pdb
- In this example, the theseus_align script is called to do the hard
work for you. It will calculate a sequence alignment and then superpose
based on that alignment. The script theseus_align takes the same
options as the theseus program. Note, the first few lines of this
script must be modified for your system, since it calls an external
multiple sequence alignment program to do the alignment. See the
examples/ directory for more details, including example files.
OPTIONS¶
Algorithmic options, defaults in {brackets}:¶
- --amber
- Do special processing for AMBER8 formatted PDB files
- Most people will never need to use this long option, unless you are
processing MD traces from AMBER. AMBER puts the atom names in the wrong
column in the PDB file.
- -a [selection]
- Atoms to include in the superposition. This option takes two types of
arguments, either (1) a number specifying a preselected set of atom types,
or (2) an explict PDB-style, colon-delimited list of the atoms to
include.
- For the preselected atom type subsets, the following integer options are
available:
- •
- 0, alpha carbons for proteins, C1´ atoms for nucleic acids
- •
- 1, backbone
- •
- 2, all
- •
- 3, alpha and beta carbons
- •
- 4, all heavy atoms (no hydrogens)
- Note, only the -a0 option is available when superpositioning
structures with different sequences.
- To custom select an explicit set of atom types, the atom types must be
specified exactly as given in the PDB file field, including spaces, and
the atom-types must encapsulated in quotation marks. Multiple atom types
must be delimited by a colon. For example,
- -a ` N : CA : C : O '
- would specify the atom types in the peptide backbone.
- -f
- Only read the first model of a multi-model PDB file
- -h
- Help/usage
- -i [nnn]
- Maximum iterations, {200}
- -p [precision]
- Requested relative precision for convergence, {1e-7}
- -r [root name]
- Root name to be used in naming the output files, {theseus}
- -s [n-n:...]
- Residue selection (e.g. -s15-45:50-55), {all}
- -S [n-n:...]
- Residues to exclude (e.g. -S15-45:50-55) {none}
- The previous two options have the same format. Residue (or alignment
column) ranges are indicated by beginning and end separated by a dash.
Multiple ranges, in any arbitrary order, are separated by a colon. Chains
may also be selected by giving the chain ID immediately preceding the
residue range. For example, -sA1-20:A40-71 will only include
residues 1 through 20 and 40 through 70 in chain A. Chains cannot be
specified when superposing structures with different sequences.
- -v
- use ML variance weighting (no correlations) {default}
- -A [sequence alignment file]
- Sequence alignment file to use as a guide (CLUSTAL or A2M format)
- For use when superposing structures with different sequences. See ``FILE
FORMATS'' below.
- -E
- Print expert options
- -F
- Print FASTA files of the sequences in PDB files and quit
- A useful option when superposing structures with different sequences. The
files output with this option can be aligned with a multiple sequence
alignment program such as CLUSTAL or MUSCLE, and the resulting output
alignment file used as theseus input with the -A option.
- -h
- Help/usage
- -I
- Just calculate statistics for input file; don't superpose
- -M [mapfile]
- File that maps PDB files to sequences in the alignment.
- A simple two-column formatted file; see ``FILE FORMATS'' below. Used with
mode 2.
- -n
- Don't write transformed pdb file
- -o [reference structure]
- Reference file to superpose on, all rotations are relative to the first
model in this file
- For example, 'theseus -o cytc1.pdb cytc1.pdb cytc2.pdb cytc3.pdb' will
superpose the structures and rotate the entire final superposition so that
the structure from cytc1.pdb is in the same orientation as the structure
in the original cytc1.pdb PDB file.
- -V
- Version
Principal components analysis:¶
- -C
- Use covariance matrix for PCA (correlation matrix is default)
- -P [nnn]
- Number of principal components to calculate {0}
- In both of the above, the corresponding principal component is written in
the B-factor field of the output PDB file. Usually only the first few PCs
are of any interest (maybe up to six).
EXAMPLES theseus 2sdf.pdb
theseus -l -r new2sdf
2sdf.pdb
theseus -s15-45 -P3
2sdf.pdb
theseus -A
cytc.aln -M
cytc.mapfile -o
cytc1.pdb
-s1-40
cytc1.pdb cytc2.pdb cytc3.pdb cytc4.pdb
ENVIRONMENT¶
You can set the environment variable 'PDBDIR' to your PDB file directory and
theseus will look there after the present working directory. For
example, in the C shell (tcsh or csh), you can put something akin to this in
your .cshrc file:
setenv PDBDIR '/usr/share/pdbs/'
Theseus will read standard PDB formatted files (see
<
http://www.rcsb.org/pdb/>). Every effort has been made for the program
to accept nonstandard CNS and X-PLOR file formats also.
Two other files deserve mention, a sequence alignment file and a mapfile.
Sequence alignment file¶
When superposing structures with different residue identities (where the lengths
of each the macromolecules in terms of residues are not necessarily equal), a
sequence alignment file must be included for
theseus to use as a guide
(specified by the
-A option).
Theseus accepts both CLUSTAL and
A2M (FASTA) formatted multiple sequence alignment files.
NOTE 1: The residue sequence in the alignment must match exactly the residue
sequence given in the coordinates of the PDB file. That is, there can be no
missing or extra residues that do not correspond to the sequence in the PDB
file. An easy way to ensure that your sequences exactly match the PDB files is
to generate the sequences using
theseus' -F option, which writes
out a FASTA formatted sequence file of the chain(s) in the PDB files. The
files output with this option can then be aligned with a multiple sequence
alignment program such as CLUSTAL or MUSCLE, and the resulting output
alignment file used as
theseus input with the
-A option.
NOTE 2: Every PDB file must have a corresponding sequence in the alignment.
However, not every sequence in the alignment needs to have a corresponding PDB
file. That is, there can be extra sequences in the alignment that are not used
for guiding the superposition.
PDB -> Sequence mapfile¶
If the names of the PDB files and the names of the corresponding sequences in
the alignemnt are identical, the mapfile may be omitted. Otherwise,
Theseus needs to know which sequences in the alignment file correspond
to which PDB structure files. This information is included in a mapfile with a
very simple format (specified with the
-M option). There are only two
columns separated by whitespace: the first column lists the names of the PDB
structure files, while the second column lists the corresponding sequence
names exactly as given in the multiple sequence alignment file.
An example of the mapfile:
cytc1.pdb seq1
cytc2.pdb seq2
cytc3.pdb seq3
SCREEN OUTPUT¶
Theseus provides output describing both the progress of the superposing and
several statistics for the final result:
- Classical LS pairwise <RMSD>:
- The conventional RMSD for the superposition, the average RMSD for all
pairwise combinations of structures in the ensemble.
- Least-squares <sigma>:
- The standard deviation for the superposition, based on the conventional
assumption of no correlation and equal variances. Basically equal to the
RMSD from the average structure.
- Maximum Likelihood <sigma>:
- The ML analog of the standard deviation for the superposition. When
assuming that the correlations are zero (a diagonal covariance matrix),
this is equal to the square root of the harmonic average of the variances
for each atom. In contrast, the ``Least-squares <sigma>'' given
above reports the square root of the arithmetic average of the variances.
The harmonic average is always less than the arithmetic average, and the
harmonic average downweights large values proportional to their magnitude.
This makes sense statistically, because when combining values one should
weight them by the reciprocal of their variance (which is in fact what the
ML superposing method does).
- Marginal Log Likelihood:
- The final marginal log likelihood of the superposition, assuming the
matrix Gaussian distribution of the structures and the hierarchical
inverse gamma distribution of the eigenvalues of the covariance matrix.
The marginal log likelihood is the likelihood with the covariance matrix
integrated out.
- AIC:
- The Akaike Information Criterion for the final superposition. This is an
important statistic in likelihood analysis and model selection theory. It
allows an objective comparison of multiple theoretical models with
different numbers of parameters. In this case, the higher the number the
better. There is a tradeoff between fit to the data and the number of
parameters being fit. Increasing the number of parameters in a model will
always give a better fit to the data, but it also increases the
uncertainty of the estimated values. The AIC criterion finds the best
combination by (1) maximizing the fit to the data while (2) minimizing the
uncertainty due to the number of parameters. In the superposition case,
one can compare the least squares superposition to the maximum likelihood
superposition. The method (or model) with the higher AIC is preferred. A
difference in the AIC of 2 or more is considered strong statistical
evidence for the better model.
- BIC:
- The Bayesian Information Criterion. Similar to the AIC, but with a
Bayesian emphasis.
- Omnibus chi2:
- The overall reduced chi2 statistic for the entire fit, including the
rotations, translations, covariances, and the inverse gamma parameters.
This is probably the most important statistic for the superposition. In
some cases, the inverse gamma fit may be poor, yet the overall fit is
still very good. Again, it should ideally be close to 1.0, which would
indicate a perfect fit. However, if you think it is too large, make sure
to compare it to the chi2 for the least-squares fit; it's probably not
that bad after all. A large chi2 often indicates a violation of the
assumptions of the model. The most common violation is when superposing
two or more independent domains that can rotate relative to each other. If
this is the case, then there will likely be not just one Gaussian
distribution, but several mixed Gaussians, one for each domain. Then, it
would be better to superpose each domain independently.
- Hierarchical var (alpha, gamma) chi2:
- The reduced chi2 for the inverse gamma fit of the covariance matrix
eigenvalues. As before, it should ideally be close to 1.0. The two values
in the parentheses are the ML estimates of the scale and shape parameters,
respectively, for the inverse gamma distribtuion.
- Rotational, translational, covar chi2:
- The reduced chi2 statistic for the fit of the structures to the model.
With a good fit it should be close to 1.0, which indicates a perfect fit
of the data to the statistical model. In the case of least-squares, the
assumed model is a matrix Gaussian distribution of the structures with
equal variances and no correlations. For the ML fits, the assumed model is
unequal variances and no correlations, as calculated with the -v
option [default]. This statistic is for the superposition only, and does
not include the fit of the covariance matrix eigenvalues to an inverse
gamma distribution. See ``Omnibus chi2'' below.
- Hierarchical minimum var:
- The hierarchical fit of the inverse gamma distribution constrains the
variances of the atoms by making large ones smaller and small ones larger.
This statistic reports the minimum possible variance given the inferred
inverse gamma parameters.
- skewness, skewness Z-value, kurtosis & kurtosis Z-value:
- The skewness and kurtosis of the residuals. Both should be 0.0 if the
residuals fit a Gaussian distribution perfectly. They are followed by the
P-value for the statistics. This is a very stringent test; residuals can
be very non-Gaussian and yet the estimated rotations, translations, and
covariance matrix may still be rather accurate.
- Data pts, Free params, D/P:
- The total number of data points given all observed structures, the number
of parameters being fit in the model, and the data-to-parameter ratio.
- Median structure:
- The structure that is overall most similar to the average structure. This
can be considered to be the most ``typical'' structure in the ensemble.
- Total rounds:
- The number of iterations that the algorithm took to converge.
- Fractional precision:
- The actual precision that the algorithm converged to.
OUTPUT FILES¶
Theseus writes out the following files:
- theseus_sup.pdb
- The final superposition, rotated to the principle axes of the mean
structure.
- theseus_ave.pdb
- The estimate of the mean structure.
- theseus_residuals.txt
- The normalized residuals of the superposition. These can be analyzed for
deviations from normality (whether they fit a standard Gaussian
distribution). E.g., the chi2, skewness, and kurtosis statistics are based
on these values.
- theseus_transf.txt
- The final transformation rotation matrices and translation vectors.
- theseus_variances.txt
- The vector of estimated variances for each atom.
When Principal Components are calculated (with the
-P option), the
following files are also produced:
- theseus_pcvecs.txt
- The principal component vectors.
- theseus_pcstats.txt
- Simple statistics for each principle component (loadings, variance
explained, etc.).
- theseus_pcN_ave.pdb
- The average structure with the Nth principal component written in the
temperature factor field.
- theseus_pcN.pdb
- The final superposition with the Nth principal component written in the
temperature factor field. This file is omitted when superposing molecules
with different residue sequences (mode 2).
- theseus_cor.mat, theseus_cov.mat
- The atomic correlation matrix and covariance matrices, based on the final
superposition. The format is suitable for input to GNU's octave.
These are the matrices used in the Principal Components Analysis.
BUGS¶
Please send me (DLT) reports of all problems.
RESTRICTIONS¶
Theseus is
not a structural alignment program. The structure-based
alignment problem is completely different from the structural superposition
problem. In order to do a structural superposition, there must be a 1-to-1
mapping that associates the atoms in one structure with the atoms in the other
structures. In the simplest case, this means that structures must have
equivalent numbers of atoms, such as the models in an NMR PDB file. For
structures with different numbers of residues/atoms, superposing is only
possible when the sequences have been aligned previously. Finding the best
sequence alignment based on only structural information is a difficult
problem, and one for which there is currently no maximum likelihood approach.
Extending
theseus to address the structural alignment problem is an
ongoing research project.
AUTHOR¶
Douglas L. Theobald
dtheobald@brandeis.edu
CITATION¶
When using
theseus in publications please cite:
Douglas L. Theobaldand Phillip A. Steindel (2012)
``Optimal simultaneous superpositioning of multiple structures with missing
data.''
Bioinformatics 28(15):1972-1979
The following papers also report
theseus developments:
Douglas L. Theobald and Deborah S. Wuttke (2008)
``Accurate structural correlations from maximum likelihood superpositions.''
PLoS Computational Biology 4(2):e43
Douglas L. Theobald and Deborah S. Wuttke (2006)
``THESEUS: Maximum likelihood superpositioning and analysis of macromolecular
structures."
Bioinformatics 22(17):2171-2172
Douglas L. Theobald and Deborah S. Wuttke (2006)
``Empirical Bayes models for regularizing maximum likelihood estimation in the
matrix Gaussian Procrustes problem.''
PNAS 103(49):18521-18527
HISTORY¶
Long, tedious, and sordid.