.TH MSA_VIEW "1" "May 2016" "msa_view 1.4" "User Commands"
.SH NAME
msa_view \- Provides various kinds of "views" of one or more multiple
.SH DESCRIPTION
Provides various kinds of "views" of one or more multiple
alignments.  Can extract a sub\-alignment from an alignment (by row
or by column) or combine several alignments into one.  Also can
extract the sufficient statistics for phylogenetic analysis from
an alignment, optionally accounting for site categories that are
defined by an auxiliary annotations file.  Supports various other
functions, including gap stripping, column randomization, and
reordering of sequences.  Capable of reading and writing in a few
common formats.  Can be used for file conversion (by default,
output is the entire input alignment).
.SH EXAMPLE
.PP
(See below for more details on options)
.PP
1. Convert alignment formats (default input and output is FASTA)
.IP
msa_view myfile.fa \fB\-\-out\-format\fR PHYLIP > myfile.ph
.IP
msa_view myfile2.raw \fB\-\-in\-format\fR MPM > myfile2.fa
.PP
2. Obtain a sub\-alignment by position, using the coordinate frame
of the first sequence in the alignment.
.IP
msa_view myfile.fa \fB\-\-start\fR 1234 \fB\-\-end\fR 5678 \fB\-\-refidx\fR 1 > mysub.fa
.PP
3. Obtain a sub\-alignment by sequence.
.IP
msa_view myfile.fa \fB\-\-seqs\fR 1,4,5 > seqs145.fa
.IP
msa_view myfile.fa \fB\-\-seqs\fR 1,4,5 \fB\-\-exclude\fR > seqs236.fa
.PP
(can also specify sequences by name, e.g., \fB\-\-seqs\fR cow,rat,pig)
.PP
4. Concatenate alignments.
.IP
msa_view \fB\-\-aggregate\fR human,mouse,rat myf1.fa myf2.fa myf3.fa
\f(CW> concat.fa\fR
.PP
(source alignments may have different subsets of sequences and may
use different sequence orders; here, human,mouse,rat defines full
set and order in output alignment)
.PP
5. Extract sufficient statistics from a FASTA file.
.IP
msa_view myfile.fa \fB\-\-out\-format\fR SS > myfile.ss
.PP
6. Extract sufficient statistics from a MAF file for a complete
human chromosome.  (Can be used by phyloFit.)
.IP
msa_view chr1.maf \fB\-\-out\-format\fR SS > chr1.ss
.PP
7. As in (6), but include information about regions of the
reference sequence not present in the MAF file, and include a
representation of the order in which alignment columns occur
(needed by programs such as phastCons or exoniphy).
.IP
msa_view chr1.maf \fB\-\-refseq\fR chr1.fa
\fB\-\-out\-format\fR SS > chr1.ordered.ss
.PP
8. As in (6), but collect statistics for pairs of adjacent sites
(can be used by phyloFit to estimate a dinucleotide model).
.IP
msa_view chr1.maf \fB\-\-out\-format\fR SS
\fB\-\-tuple\-size\fR 2 > chr1.pairs.ss
.PP
9. Pool sufficient statistics from several human chromosomes.
.IP
msa_view \fB\-\-aggregate\fR human,mouse,rat
\fB\-\-out\-format\fR SS chr1.ss chr2.ss chr3.ss > chr123.ss
.PP
10. Extract separate sufficient statistics for the three codon
positions, as defined by annotations in a GFF file.
.IP
msa_view chr1.maf \fB\-\-features\fR chr22.gff
\fB\-\-catmap\fR "NCATS = 3; CDS 1\-3" \fB\-\-out\-format\fR SS
> chr22.pos.ss
.PP
11. As in (10), but re\-orient genes on \- strand so that stats
reflect + strand.  Assume genes are defined by tag "transcript_id".
.IP
msa_view chr1.maf \fB\-\-features\fR chr22.gff
\fB\-\-catmap\fR "NCATS = 3; CDS 1\-3" \fB\-\-reverse\-groups\fR transcript_id
\fB\-\-out\-format\fR SS > chr22.pos.ss
.SH OPTIONS
.SS Obtaining sub\-alignments and combining alignments
.HP
\fB\-\-start\fR, \fB\-s\fR <start_col>
.IP
Starting column of sub\-alignment (indexing starts with 1).
Default is 1.  Note that coordinates use the frame of reference
of the entire alignment unless \fB\-\-refidx\fR 1 is specified.
.HP
\fB\-\-end\fR, \fB\-e\fR <end_col>
.TP
Ending column of sub\-alignment.
Default is length of
.TP
alignment.
Note that coordinates use the frame of reference
.IP
of the entire alignment unless \fB\-\-refidx\fR 1 is specified.
.HP
\fB\-\-seqs\fR, \fB\-l\fR <seq_list>
Comma\-separated list of sequences to include (default)
exclude (if \fB\-\-exclude\fR).  Indicate by sequence number or name
(numbering starts with 1 and is evaluated *after* \fB\-\-order\fR is
applied).
.HP
\fB\-\-exclude\fR, \fB\-x\fR
.IP
Exclude rather than include specified sequences.
.HP
\fB\-\-refidx\fR, \fB\-r\fR <ref_seq>
.TP
Index of reference sequence for coordinates.
Use 0 to
.IP
indicate the coordinate system of the alignment as a whole
(this is the default).
.HP
\fB\-\-aggregate\fR, \fB\-A\fR <name_list>
.IP
(Not compatible with \fB\-\-start\fR or \fB\-\-end\fR) Create an aggregate
alignment from a set of alignment files, by concatenating
individual alignments.  If used with \fB\-\-out\-format\fR SS and
\fB\-\-unordered\-ss\fR, the aggregate alignment will never be created
explicitly (recommended for large data sets).  The argument
<name_list> must be a list of sequence names, including all
names in all specified alignments (missing sequences will be
replaced by rows of missing data).  The standard <msa_fname>
argument should be replaced with a list of (whitespaceseparated) file names.
.HP
\fB\-\-split\-all\fR, \fB\-X\fR <filename root>
.IP
Split output alignment into separate fasta files by species.
File naming convention is filename_root.species.fa. If used with
\fB\-\-gap\-strip\fR, gap characters will be stripped from all output files.
In this case, '\-\-gap\-strip <s>' should NOT be used (ALL or ANY
should both work fine).
.SS File formats, gap stripping, reordering, etc.
.HP
\fB\-\-in\-format\fR, \fB\-i\fR PHYLIP|FASTA|MPM|MAF|SS
.TP
(Default is to guess format from file contents).
Input file
.TP
format.
FASTA is as usual.  PHYLIP is compatible with the formats
.TP
used in the PHYLIP and PAML packages.
MPM is the format used by the
.IP
MultiPipMaker aligner and some other of Webb Miller's older tools.
MAF ("Multiple Alignment Format") is used by MULTIZ/TBA and the
UCSC Genome Browser.  SS is a simple format describing the
sufficient statistics for phylogenetic inference (distinct columns
or tuple of columns and their counts).  Use \fB\-\-out\-format\fR SS with
\fB\-\-in\-format\fR MAF for best efficiency (explicit alignment is
never created).  Also, use \fB\-\-unordered\-ss\fR if possible.
.HP
\fB\-\-out\-format\fR, \fB\-o\fR PHYLIP|FASTA|MPM|SS
(Default FASTA)
Output file format.
.HP
\fB\-\-alphabet\fR, \fB\-a\fR <alphabet_string>
.TP
Use the specified alphabet (default "ACGT").
In addition,
.IP
\&'\-' characters are assumed to represent alignment gaps, and
\&'*' and 'N' characters are allowed for missing data.
Alphabetical letters not in the alphabet will be converted to
\&'N's upon input.  This option is ignored with SS input (alphabet
specified within SS files.)
.HP
\fB\-\-soft\-masked\fR, \fB\-f\fR
.IP
Implies \fB\-\-alphabet\fR 'ACGTNacgtn', useful for soft\-masked sequences.
.HP
\fB\-\-unmask\fR, \fB\-u\fR
.IP
Remove soft\-masking; convert to uppercase.
.HP
\fB\-\-pretty\fR, \fB\-P\fR
Pretty\-print alignment (use '.' when character matches
corresponding character in first sequence).  Ignored if
\fB\-\-out\-format\fR SS is selected.
.HP
\fB\-\-gap\-strip\fR, \fB\-G\fR ALL|ANY|<s>
Strip columns containing all gaps, any gaps, or a gap in the
specified sequence (<s>).  Indexing starts at one and refers
to the list *after* any sequences have been added or
subtracted (via \fB\-\-seqs\fR and \fB\-\-exclude\fR or \fB\-\-order\fR).
.HP
\fB\-\-collapse\-missing\fR, \fB\-p\fR
.IP
(For use with \fB\-o\fR SS) Convert all missing\-data characters and
gaps to "*" characters.  Can be used to make sufficient
statistics more compact, which can improve the performance of
phyloFit (all missing data and gap characters are typically
treated the same by phyloFit anyway).
.HP
\fB\-\-mark\-missing\fR, \fB\-K\fR <maxlen>
Convert all gaps of length greater than <maxlen> to "*"
characters.  If \fB\-\-refidx\fR is specified (with a positive index),
gaps in the designated reference sequence will not be altered.
This is a useful heuristic for distinguishing between
microindels and regions of missing data (e.g., due to
large\-scale indels, incomplete assemblies, or highly
diverged sequences).
.HP
\fB\-\-missing\-as\-indels\fR, \fB\-m\fR
.IP
Convert all missing data characters (Ns and *s) to gap
characters, except for Ns in a reference sequence specified by
\fB\-\-refidx\fR, which will be replaced by randomly selected
nucleotides.  (This allows the coordinate frame for the
reference sequence to be maintained; this option is only
recommended if such Ns are rare.)  If \fB\-\-refidx\fR is not
used, all Ns will be replaced by gaps.  You may want to use
\fB\-\-gap\-strip\fR ALL with this option.
.HP
\fB\-\-order\fR, \fB\-O\fR <name_list>
Change order of rows in alignment to match sequence names
specified in name_list.  If a name appears in name_list but
not in the alignment, a row of gaps will be inserted.  This
option is applied to the alignment *before* \fB\-\-seqs\fR,
\fB\-\-refidx\fR, and \fB\-\-gap\-strip\fR are applied.
.HP
\fB\-\-reverse\-complement\fR, \fB\-V\fR
.IP
Reverse complement output alignment.
.HP
\fB\-\-randomize\fR, \fB\-R\fR
Randomly permute the columns of the source alignment (done
*before* taking sub\-alignment).  Requires an ordered
representation of the alignment (careful using with
\fB\-\-in\-format\fR SS|MAF \fB\-\-\fR will create full alignment from
sufficient statistics).
.HP
\fB\-\-fill\-Ns\fR, \fB\-N\fR <s:b\-e>
.IP
Fill sequence no. <s> with Ns, from <b> to <e>. Applied before
\fB\-\-start\fR, \fB\-\-end\fR, \fB\-\-seqs\fR, \fB\-\-gap\-strip\fR, but after \fB\-\-order\fR.
Coordinate frame depends on \fB\-\-refidx\fR.  Can be used
multiple times.
.HP
\fB\-\-summary\-only\fR \fB\-S\fR
Report only summary statistics, rather than complete
alignment.  Statistics are for alignment that would otherwise
be output (i.e., after other options have been applied).
.HP
\fB\-\-window\-summary\fR, \fB\-w\fR <win_size>
Like \fB\-S\fR, but output summary statistics for non\-overlapping
windows of the specified size.
(Sufficient statistics)
.HP
\fB\-\-tuple\-size\fR, \fB\-T\fR <tup_size>
(For use with \fB\-\-out\-format\fR SS).
Represent an alignment in
terms of tuples of columns of the designated size.
Useful
.IP
with context\-dependent phylogenetic models.
.HP
\fB\-\-unordered\-ss\fR, \fB\-z\fR
.TP
(For use with \fB\-\-out\-format\fR SS).
Suppress the portion of the
.IP
sufficient statistics concerned with the order in which
columns appear.  Useful for analyses for which order is
unimportant.
(MAF input)
.HP
\fB\-\-refseq\fR, \fB\-M\fR <fname>
.IP
Read the complete text of the reference sequence from
<fname> (FASTA format) and combine it with the contents of
the MAF file to produce a complete, ordered representation of
the alignment (unaligned regions will be represented by gaps).
Best used with \fB\-\-out\-format\fR SS.  The reference sequence of the
MAF file is assumed to be the one that appears first in each
block.
.HP
\fB\-\-keep\-overlapping\fR, \fB\-k\fR
.IP
Keep blocks in MAF that have overlapping coordinates in the
reference (1st) sequence (by default, only the first one is
kept).  Useful in extracting unordered stats from a jumbled
collection of MAF blocks (e.g., output of Jim Kent's mafFrags
program).
Cannot be used with \fB\-\-refseq\fR, \fB\-\-features\fR, or
.HP
\fB\-\-cats\-cycle\fR.
(Site categories: all options require \fB\-\-out\-format\fR SS)
.HP
\fB\-\-features\fR, \fB\-g\fR <gff_fname>
.IP
(Requires \fB\-\-catmap\fR) Read sequence annotations from the
specified file (GFF) and label the columns of the alignment
accordingly.  Note: UCSC BED and genepred formats are now
recognized as well.
.HP
\fB\-\-catmap\fR, \fB\-c\fR <fname>|<string>
.IP
(optionally use with \fB\-\-features\fR) Mapping of feature types to
category numbers.  Can either give a filename or an "inline"
description of a simple category map, e.g., \fB\-\-catmap\fR "NCATS =
3 ; CDS 1\-3" or \fB\-\-catmap\fR "NCATS = 1 ; UTR 1".
.HP
\fB\-\-cats\-cycle\fR, \fB\-Y\fR <cycle_size>
(alternative to \fB\-\-features\fR and \fB\-\-catmap\fR) Assign site categories in
cycles of the specified size, e.g., as 1,2,3,...,1,2,3 (for
cycle_size == 3).  Useful for in\-frame coding sequence, or to
partition a data set into nonoverlapping tuples of columns
(use with \fB\-\-do\-cats\fR).
.HP
\fB\-\-do\-cats\fR, \fB\-C\fR <cat_list>
.TP
(For use with \fB\-\-features\fR or \fB\-\-cats\-cycle\fR)
Obtain
.IP
sufficient statistics only for the specified categories
(comma\-delimited list, by number).
.HP
\fB\-\-codons\fR, \fB\-D\fR
Extract sufficient statistics for in\-frame codons.
Implies
\fB\-\-tuple\-size\fR 3 \fB\-\-cats\-cycle\fR 3 \fB\-\-do\-cats\fR 3.
Not appropriate
.IP
for use with \fB\-\-features\fR/\-\-catmap.
.HP
\fB\-\-reverse\-groups\fR, \fB\-W\fR <tag>
.IP
(For use with \fB\-\-features\fR) Group features by <tag> (e.g.,
"transcript_id" or "exon_id") and reverse complement
segments of the alignment corresponding to groups on the
reverse strand.  Groups must be non\-overlapping (see refeature
\fB\-\-unique\fR).  Useful when extracting sufficient statistics for
strand\-specific site categories (e.g., codon positions).
.HP
\fB\-\-4d\fR, \fB\-4\fR
.IP
(For use with \fB\-\-features\fR; assumes coding regions have feature
type 'CDS')  Extract sufficient statistics for fourfold
degenerate synonymous sites.  Implies \fB\-\-out\-format\fR SS
\fB\-\-unordered\-stats\fR \fB\-\-tuple\-size\fR 3 \fB\-\-reverse\-groups\fR transcript_id.
.SH Alignment cleaning
.HP
\fB\-\-clean\-coding\fR, \fB\-L\fR <seqname>
.IP
Clean an alignment of coding sequences with respect to a named
reference sequence.  Removes sites with gaps and blocks of
gapless sites smaller than 10 codons in length, ensures
everything is in\-frame wrt reference sequence, prohibits
in\-frame stop codons.  Reference sequence must begin with a
start codon and end with a stop codon.
.HP
\fB\-\-clean\-indels\fR, \fB\-I\fR <nseqs>
.TP
Clean an alignment with special attention to indels.
Sites
.IP
with fewer than <nseqs> bases are removed; bases adjacent to
indels, and short gapless subsequences, are replaced with Ns.
If used with \fB\-\-tuple\-size\fR, then <tup_size>\-1 columns of Ns
will be retained between columns not adjacent in the original
alignment.  Frame is not considered.
.SS Other
.HP
\fB\-\-help\fR, \fB\-h\fR
Print this help message.