.\" Man page generated from reStructuredText.
.
.TH "AXE" "1" "Feb 14, 2021" "0.3.3" "axe"
.SH NAME
axe \- axe Documentation
.
.nr rst2man-indent-level 0
.
.de1 rstReportMargin
\\$1 \\n[an-margin]
level \\n[rst2man-indent-level]
level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
-
\\n[rst2man-indent0]
\\n[rst2man-indent1]
\\n[rst2man-indent2]
..
.de1 INDENT
.\" .rstReportMargin pre:
. RS \\$1
. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
. nr rst2man-indent-level +1
.\" .rstReportMargin post:
..
.de UNINDENT
. RE
.\" indent \\n[an-margin]
.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
.nr rst2man-indent-level -1
.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
..
.sp
Axe is a read de\-multiplexer, useful in situations where sequence reads contain
the indexes that uniquely distinguish samples. Axe uses a rapid and accurate
algorithm based on hamming mismatch tries to competitively match the prefix of
a sequencing read against a set of indexes. Axe supports combinatorial
indexing schemes.
.sp
Contents:
.SH AXE TUTORIAL
.sp
In this tutorial, we\(aqll use Axe to demultiplex some paired\-end,
combinatorially\-index Genotyping\-by\-Sequencing reads. The data for this
tutorial is available from figshare:
\fI\%https://figshare.com/articles/axe\-tutorial_tar/6143720\fP .
.sp
Axe should be run as the initial step of any analysis: don\(aqt use sequence QC
tools like AdapterRemoval or Trimmomatic before using axe, as indexes may be
trimmed away, or pairing information removed.
.SS Step 0: Download the trial data
.sp
This will download the trial data, and extract it on the fly:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
curl \-LS https://ndownloader.figshare.com/files/11094782 | tar xv
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Step 1: prepare a key file
.sp
The key file associates index sequences with sample names. A key file can be
prepared in a spreadsheet editor, like LibreOffice Calc, or Excel. The format
is quite strict, and is described in detail in the online usage documentation.
.sp
Let\(aqs now inspect the keyfile I have provided for the tutorial.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
head axe\-keyfile.tsv
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Step 2: Demultiplex with Axe
.sp
In this step, we will demultiplex our interleaved input file to per\-sample
interleaved output files. To see a full range of Axe\(aqs options, please run
\fBaxe\-demux \-h\fP, or inspect the online usage documentation.
.sp
First, let\(aqs inspect the input.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
zcat axe\-tutorial.fastq.gz | head \-n 8
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Then, we need to ensure that axe has somewhere to put the demultiplexed reads.
Axe outputs one file (or more, depending on pairing) per sample. Axe does so by
appending the sample name to some prefix (as given by the \fB\-I\fP, \fB\-F\fP,
and/or \fB\-R\fP options). If this prefix is a directory, then sample fastq files
will be created in that sub\-directory, but the directory must exist. Let\(aqs make
an output directory:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
mkdir \-p output
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Now, let\(aqs demultiplex the reads!
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
axe\-demux \-i axe\-tutorial.fastq.gz \-I output/ \e
   \-c \-b axe\-keyfile.tsv \-t demux\-stats.tsv \-z 1
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The command above demultiplexes reads from \fBaxe\-tutorial.fastq.gz\fP into
separate files under \fBoutput\fP, based on the combinatorial (\fB\-c\fP)
sample\-to\-index\-sequence mapping described in \fBaxe\-keyfile.tsv\fP, and saves a
file of statistics as \fBdemux\-stats.tsv\fP\&. Note that we have enabled
compression of output files using the \fB\-z\fP option, in case you don\(aqt have
much disk space available. This will make Axe slightly slower.
.SH AXE USAGE
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
For arcane reasons, the name of the \fBaxe\fP binary changed to \fBaxe\-demux\fP
with version 0.3.0. Apologies for the inconvenience, this was required to
make \fBaxe\fP installable in Debian and its derivatives. Command\-line usage
did not change.
.UNINDENT
.UNINDENT
.sp
Axe has several usage modes. The primary distinction is between the two
alternate indexing schemes, single and combinatorial indexing. Single index
matching is used when only the first read contains index sequences.
Combinatorial indexing is used when both reads in a read pair contain
independent (typically different) index sequences.
.sp
For concise reference, the command\-line usage of \fBaxe\-demux\fP is reproduced
below:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
USAGE:
axe\-demux [\-mzc2pt] \-b (\-f [\-r] | \-i) (\-F [\-R] | \-I)
axe\-demux \-h
axe\-demux \-v

OPTIONS:
    \-m, \-\-mismatch	Maximum hamming distance mismatch. [int, default 1]
    \-z, \-\-ziplevel	Gzip compression level, or 0 for plain text [int, default 0]
    \-c, \-\-combinatorial	Use combinatorial barcode matching. [flag, default OFF]
    \-p, \-\-permissive	Don\(aqt error on barcode mismatch confict, matching only
                    	exactly for conficting barcodes. [flag, default OFF]
    \-2, \-\-trim\-r2	Trim barcode from R2 read as well as R1. [flag, default OFF]
    \-b, \-\-barcodes	Barcode file. See \-\-help for example. [file]
    \-f, \-\-fwd\-in	Input forward read. [file]
    \-F, \-\-fwd\-out	Output forward read prefix. [file]
    \-r, \-\-rev\-in	Input reverse read. [file]
    \-R, \-\-rev\-out	Output reverse read prefix. [file]
    \-i, \-\-ilfq\-in	Input interleaved paired reads. [file]
    \-I, \-\-ilfq\-out	Output interleaved paired reads prefix. [file]
    \-t, \-\-table\-file	Output a summary table of demultiplexing statistics to file. [file]
    \-h, \-\-help		Print this usage plus additional help.
    \-V, \-\-version	Print version string.
    \-v, \-\-verbose	Be more verbose. Additive, \-vv is more vebose than \-v.
    \-q, \-\-quiet		Be very quiet.

.ft P
.fi
.UNINDENT
.UNINDENT
.SS Inputs and Outputs
.sp
Regardless of read mode, three input and output schemes are supported:
single\-end reads, paired reads (separate R1 and R2 files) and interleaved
paired reads (one file, with R1 and R2 as consecutive reads). If single end
reads are inputted, they must be output as single end reads. If either paired or
interleaved paired reads are read, they can be output as either paired reads or
interleaved paired reads. This applies to both successfully de\-multiplexed reads
and reads that could not be de\-multiplexed.
.sp
The \fB\-z\fP flag can be used to specify that outputs should be compressed using
gzip compression. The \fB\-z\fP flag takes an integer argument between 0 (the
default) and 9, where 0 indicates plain text output (\fBgzopen\fP mode "wT"), and
1\-9 indicate that the respective compression level should be used, where 1 is
fastest and 9 is most compact.
.sp
The output flags should be prefixes that are used to generate the output file
name based on the index\(aqs (or index pair\(aqs) ID. The names are generated as:
\fBprefix\fP + \fB_\fP + \fBindex ID\fP + \fB_\fP + \fBread number\fP + \fB\&.extension\fP\&.
The output file for reads that could not be demultiplexed is \fBprefix\fP + \fB_\fP
+ \fBunknown\fP + \fB_\fP + \fBread number\fP + \fB\&.extension\fP\&.  The read number is
omitted unless the paired read file scheme is used, and is "il" for interleaved
output. The extension is "fastq"; ".gz" is appended to the extension if the
\fB\-z\fP flag is used.
.INDENT 0.0
.TP
.B The corresponding CLI flags are:
.INDENT 7.0
.IP \(bu 2
\fB\-f\fP and \fB\-F\fP: Single end or paired R1 file input and output
respectively.
.IP \(bu 2
\fB\-r\fP and \fB\-R\fP: Paired R2 file input and output.
.IP \(bu 2
\fB\-i\fP and \fB\-I\fP: Interleaved paired input and output.
.UNINDENT
.UNINDENT
.SS The index file
.sp
The index file is a tab\-separated file with an optional header. It is
mandatory, and is always supplied using the \fB\-b\fP command line flag. The exact
format is dependent on indexing mode, and is described further in the sections
below. If a header is present, the header line must start with either
\fIBarcode\fP or \fBindex\fP, or it will be interpreted as a index line, leading
to a parsing error. Any line starting with \(aq;\(aq or \(aq#\(aq is ignored, allowing
comments to be added in line with indexes. Please ensure that the software
used to produce the index uses ASCII encoding, and does not insert a
Byte\-order Mark (BoM) as many text editors can silently use Unicode\-based
encoding schemes. I recommend the use of
\fI\%LibreOffice Calc\fP (part of a free and open source
office suite) to generate index tables; Microsoft Excel can also be used.
.SS Mismatch level selection
.sp
Independent of index mode, the \fB\-m\fP flag is used to select the maximum
allowable hamming distance between a read\(aqs prefix and a index to be
considered as a match. As "mutated" indexes must be unique, a hamming distance
of one is the default as typically indexes are designed to differ by a hamming
distance of at least two. Optionally, (using the \fB\-p\fP flag), axe will allow
selective mismatch levels, where, if clashes are observed, the index will
only be matched exactly. This allows one to process datasets with indexes that
don\(aqt have a sufficiently high distance between them.
.SS Single index mode
.sp
Single index mode is the default mode of operation. Barcodes are matched
against read one (hereafter the forward read), and the index is trimmed from
only the forward read, unless the \fB\-2\fP command line flag is given, in which
case a prefix the same length as the matched index is also trimmed from the
second or reverse read. Note that sequence of this second read is not checked
before trimming.
.sp
In single index mode, the index file has two columns: \fBBarcode\fP and
\fBID\fP\&.
.SS Combinatorial index mode
.sp
Combinatorial index mode is activated by giving the \fB\-c\fP flag on the
command line. Forward read indexes are matched against the forward read, and
reverse read indexes are matched against the reverse read. The optimal
indexes are selected independently, and the index pair is selected from
these two indexes. The respective  indexes are trimmed from both reads; the
\fB\-2\fP command line flag has no effect in combinatorial index mode.
.sp
In combinatorial index mode, the index file has three columns:
\fBBarcode1\fP, \fBBarcode2\fP and \fBID\fP\&. Individual indexes can occur many times
within the forward and reverse indexes, but index pairs must be unique
combinations.
.SS The Demultiplexing Statistics File
.sp
The \fB\-t\fP option allows the output of per\-sample read counts to a
tab\-separated file. The file will have a header describing its format, and
includes a line for reads which could not be demultiplexed.
.SH AXE'S MATCHING ALGORITHM
.sp
Axe uses an algorithm based on longest\-prefix\-in\-trie matching to match a
variable length from the start of each read against a set of \(aqmutated\(aq
indexes.
.SS Hamming distance matching
.sp
While for most applications in high\-throughput sequencing hamming distances are
a frowned\-upon metric, it is typical for HTS read indexes to be designed to
tolerate a certain level of hamming mismatches. Given these sequences are short
and typically occur at the 5\(aq end of reads, insertions and deletions rarely
need be considered, and the increased rate of assignment of reads with many
errors is offset by the risk of falsely assigning indexes to an incorrect
sample. In any case, reads with more than 1\-2 sequencing errors in their first
several bases are likely to be poor quality, and will simply be filtered out
during downstream quality control.
.SS Hamming mismatch tries
.sp
Typically, reads are matched to a set of indexes by calculating the hamming
distance between the index, and the first l bases of a read for a
index of length l\&. The "correct" index is then selected by
recording either the index with the lowest hamming distance to the read
(competitive matching) or by simply accepting the first index with a hamming
distance below a certain threshold.  These approaches are both very
computationally expensive, and can have lower accuracy than the algorithm I
propose. Additionally, implementations of these methods rarely handle indexes
of differing length and combinatorial indexing well, if at all.
.sp
Central to Axe\(aqs algorithm is the concept of hamming\-mismatch tries. A trie is
a N\-ary tree for an N letter alphabet. In the case of high\-throughput
sequencing reads, we have the alphabet \fBAGCT\fP, corresponding to the four
nucleotides of DNA, plus \fBN\fP, used to represent ambiguous base calls. Instead
of matching each index to each read, we pre\-calculate all allowable sequences
at each mismatch level, and store these in level\-wise tries. For  example, to
match to a hamming distance of 2, we create three tries: One containing all
indexes, verbatim, and two tries where every sequence within a hamming
distance of 1 and 2 of each index respectively. Hereafter, these tries are
referred to  as the 0, 1 and 2\-mm tries, for a hamming distance (mismatch) of
0, 1 and 2. Then, we find the longest prefix in each sequence read in the 0mm
trie. If this prefix is not a valid leaf in the 0mm trie, we find the longest
prefix in the 1mm trie, and so on for all tries in ascending order. If no
prefix of the read is a complete sequence in any trie, the read is assigned to
an "non\-indexed" output file.
.sp
This algorithm ensures optimal index matching in many ways, but is also
extremely fast. In situations with indexes of differing length, we ensure that
the \fIlongest\fP acceptable index at a given hamming distance is chosen;
assuming that sequence is random after the index, the probability of false
assignments using this method is low. We also ensure that short perfect matches
are preferred to longer inexact matches, as we firstly only consider indexes
with no error, then 1 error, and so on. This ensures that reads with indexes
that are followed by random sequence that happens to inexactly match a longer
index in the set are not falsely assigned to this longer index.
.sp
The speed of this algorithm is largely due to the constant time matching
algorithm with respect to the number of indexes to match. The time taken to
match each read is proportional instead to the length of the indexes, as for a
index of length l, at most l + 1 trie level descents are
required to find an entry in the trie. As this length is more\-or\-less constant
and small, the overall complexity of axe\(aqs algorithm is O(n) for
n reads, as opposed to O(nm) for n reads and m
indexes as is typical for traditional matching algorithms
.INDENT 0.0
.IP \(bu 2
genindex
.UNINDENT
.SH AUTHOR
Kevin Murray
.SH COPYRIGHT
2021, Kevin Murray
.\" Generated by docutils manpage writer.
.