.\" Man page generated from reStructuredText.
.
.TH "PAIRTOOLS" "1" "Dec 07, 2020" "0.3.0" "pairtools"
.SH NAME
pairtools \- pairtools Documentation
.
.nr rst2man-indent-level 0
.
.de1 rstReportMargin
\\$1 \\n[an-margin]
level \\n[rst2man-indent-level]
level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
-
\\n[rst2man-indent0]
\\n[rst2man-indent1]
\\n[rst2man-indent2]
..
.de1 INDENT
.\" .rstReportMargin pre:
. RS \\$1
. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
. nr rst2man-indent-level +1
.\" .rstReportMargin post:
..
.de UNINDENT
. RE
.\" indent \\n[an-margin]
.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
.nr rst2man-indent-level -1
.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
..
.sp
\fIpairtools\fP is a simple and fast command\-line framework to process sequencing
data from a Hi\-C experiment. \fIpairtools\fP perform various operations on Hi\-C
pairs and occupy the middle position in a typical Hi\-C data processing
pipeline:
.INDENT 0.0
.INDENT 2.5
[image: The diagram of a typical processing pipeline for Hi-C data]
[image]
In a typical Hi\-C pipeline, DNA sequences (reads) are aligned to the
reference genome, converted into ligation junctions and binned, thus
producing a Hi\-C contact map..UNINDENT
.UNINDENT
.sp
\fIpairtools\fP aim to be an all\-in\-one tool for processing Hi\-C pairs, and
can perform following operations:
.INDENT 0.0
.IP \(bu 2
detect ligation junctions (a.k.a. Hi\-C pairs) in aligned paired\-end sequences of Hi\-C DNA molecules
.IP \(bu 2
sort .pairs files for downstream analyses
.IP \(bu 2
detect, tag and remove PCR/optical duplicates
.IP \(bu 2
generate extensive statistics of Hi\-C datasets
.IP \(bu 2
select Hi\-C pairs given flexibly defined criteria
.IP \(bu 2
restore .sam alignments from Hi\-C pairs
.UNINDENT
.sp
\fIpairtools\fP produce .pairs files compliant with the
\fI\%4DN standard\fP\&.
.sp
The full list of available pairtools:
.TS
center;
|l|l|.
_
T{
Pairtool
T}	T{
Description
T}
_
T{
dedup
T}	T{
Find and remove PCR/optical duplicates.
T}
_
T{
filterbycov
T}	T{
Remove pairs from regions of high coverage.
T}
_
T{
flip
T}	T{
Flip pairs to get an upper\-triangular matrix.
T}
_
T{
markasdup
T}	T{
Tag pairs as duplicates.
T}
_
T{
merge
T}	T{
Merge sorted .pairs/.pairsam files.
T}
_
T{
parse
T}	T{
Find ligation junctions in .sam, make .pairs.
T}
_
T{
phase
T}	T{
Phase pairs mapped to a diploid genome.
T}
_
T{
restrict
T}	T{
Assign restriction fragments to pairs.
T}
_
T{
select
T}	T{
Select pairs according to some condition.
T}
_
T{
sort
T}	T{
Sort a .pairs/.pairsam file.
T}
_
T{
split
T}	T{
Split a .pairsam file into .pairs and .sam.
T}
_
T{
stats
T}	T{
Calculate pairs statistics.
T}
_
.TE
.sp
Contents:
.SH QUICKSTART
.sp
Install \fIpairtools\fP and all of its dependencies using the
\fI\%conda\fP package manager and
the \fI\%bioconda\fP channel for bioinformatics
software.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ conda install \-c conda\-forge \-c bioconda pairtools
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Setup a new test folder and download a small Hi\-C dataset mapped to sacCer3 genome:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ mkdir /tmp/test\-pairtools
$ cd /tmp/test\-pairtools
$ wget https://github.com/mirnylab/distiller\-test\-data/raw/master/bam/MATalpha_R1.bam
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Additionally, we will need a .chromsizes file, a TAB\-separated plain text table describing the names, sizes and the order of chromosomes in the genome assembly used during mapping:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ wget https://raw.githubusercontent.com/mirnylab/distiller\-test\-data/master/genome/sacCer3.reduced.chrom.sizes
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
With \fIpairtools parse\fP, we can convert paired\-end sequence alignments stored in .sam/.bam format into .pairs, a TAB\-separated table of Hi\-C ligation junctions:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ pairtools parse \-c sacCer3.reduced.chrom.sizes \-o MATalpha_R1.pairs.gz \-\-drop\-sam MATalpha_R1.bam
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Inspect the resulting table:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ less MATalpha_R1.pairs.gz
.ft P
.fi
.UNINDENT
.UNINDENT
.SH INSTALLATION
.SS Requirements
.INDENT 0.0
.IP \(bu 2
Python 3.x
.IP \(bu 2
Python packages \fInumpy\fP and \fIclick\fP
.IP \(bu 2
Command\-line utilities \fIsort\fP (the Unix version), \fIbgzip\fP (shipped with \fItabix\fP)
and \fIsamtools\fP\&. If available, \fIpairtools\fP can compress outputs with \fIpbgzip\fP and \fIlz4\fP\&.
.UNINDENT
.SS Install using conda
.sp
We highly recommend using the \fIconda\fP package manager to install pre\-compiled
\fIpairtools\fP together with all its dependencies. To get it, you can either
install the full \fI\%Anaconda\fP Python
distribution or just the standalone
\fI\%conda\fP package manager.
.sp
With \fIconda\fP, you can install pre\-compiled \fIpairtools\fP and all of its
dependencies from the \fI\%bioconda\fP channel:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ conda install \-c conda\-forge \-c bioconda pairtools
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Install using pip
.sp
Alternatively, compile and install \fIpairtools\fP and its Python dependencies from
PyPI using pip:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ pip install pairtools
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Install the development version
.sp
Finally, you can install the latest development version of \fIpairtools\fP from
github. First, make a local clone of the github repository:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ git clone https://github.com/mirnylab/pairtools
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Then, you can compile and install \fIpairtools\fP in
\fI\%the development mode\fP,
which installs the package without moving it to a system folder and thus allows
immediate live\-testing any changes in the python code. Please, make sure that you
have \fIcython\fP installed!
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ cd pairtools
$ pip install \-e ./
.ft P
.fi
.UNINDENT
.UNINDENT
.SH PARSING SEQUENCE ALIGNMENTS INTO HI-C PAIRS
.SS Overview
.sp
Hi\-C experiments aim to measure the frequencies of contacts between all pairs
of loci in the genome. In these experiments, the spacial structure of chromosomes
if first fixed with formaldehyde crosslinks, after which DNA is partially
digested with restriction enzymes and then re\-ligated back. Then, DNA is
shredded into smaller pieces, released from nucleus, sequenced and aligned to
the reference genome. The resulting sequence alignments reveal if DNA molecules
were formed through ligations between DNA from different locations in the genome.
These ligation events imply that ligated loci were close to each other
when the ligation enzyme was active, i.e. they formed "a contact".
.sp
\fBpairtools parse\fP detects ligation events in the aligned sequences of
DNA molecules formed in Hi\-C experiments and reports them in the .pairs/.pairsam
format.
.SS Terminology
.sp
Throughout this document we will be using the same visual language to describe
how DNA sequences (in the .fastq format) are transformed into sequence alignments
(.sam/.bam) and into ligation events (.pairs).
.INDENT 0.0
.INDENT 2.5
[image: The visual language to describe transformation of Hi-C data]
[image]
DNA sequences (reads) are aligned to the reference genome and converted into
ligation events.UNINDENT
.UNINDENT
.sp
Short\-read sequencing determines the sequences of the both ends (or, \fBsides\fP)
of DNA molecules (typically 50\-300 bp), producing \fBread pairs\fP in .fastq format
(shown in the first row on the figure above).
In such reads, base pairs are reported from the tips inwards, which is also
defined as the \fB5\(aq\->3\(aq\fP direction (in accordance of the 5\(aq\->3\(aq direction of the
DNA strand that sequence of the corresponding side of the read).
.sp
Alignment software maps both reads of a pair to the reference genome, producing
\fBalignments\fP, i.e. segments of the reference genome with matching sequences.
Typically, there will be only two alignments per read pair, one on each side.
But, sometimes, the parts of one or both sides may map
to different locations on the genome, producing more than two alignments per
DNA molecule (see \fI\%Multiple ligations (walks)\fP).
.sp
\fBpairtools parse\fP converts alignments into \fBligation events\fP (aka
\fBHi\-C pairs\fP aka \fBpairs\fP). In the simplest case, when each side has only one
unique alignment (i.e. the whole side maps to a single unique segment of the
genome), for each side, we report the chromosome, the genomic position of the
outer\-most (5\(aq) aligned base pair and the strand of the reference genome that
the read aligns to.  \fBpairtools parse\fP assigns to such pairs the type \fBUU\fP
(unique\-unique).
.SS Unmapped/multimapped reads
.sp
Sometimes, one side or both sides of a read pair may not align to the
reference genome:
.INDENT 0.0
.INDENT 2.5
[image: Read pairs missing an alignment on one or both sides]
[image]
Read pairs missing an alignment on one or both sides.UNINDENT
.UNINDENT
.sp
In this case, \fBpairtools parse\fP fills in the chromosome of the corresponding
side of Hi\-C pair with \fB!\fP, the position with \fB0\fP and the strand with \fB\-\fP\&.
Such pairs are reported as type \fBNU\fP (null\-unique, when the other side has
a unique alignment) or \fBNN\fP (null\-null, when both sides lack any alignment).
.sp
Similarly, when one or both sides map to many genome locations equally well (i.e.
have non\-unique, or, multi\-mapping alignments), \fBpairtools parse\fP reports
the corresponding sides as (chromosome= \fB!\fP, position= \fB0\fP, strand= \fB\-\fP) and
type \fBMU\fP (multi\-unique) or \fBMM\fP (multi\-multi) or \fBNM\fP (null\-multi),
depending on the type of the alignment on the other side.
.INDENT 0.0
.INDENT 2.5
[image: Read pairs with a non-unique alignment on one or both sides]
[image]
Read pairs with a non\-unique (multi\-) alignment on one side.UNINDENT
.UNINDENT
.sp
\fBpairtools parse\fP calls an alignment to be multi\-mapping when its
\fI\%MAPQ score\fP
(which depends on the scoring gap between the two best candidate alignments for a segment)
is equal or greater than the value specified with the \fB\-\-min\-mapq\fP flag (by default, 1).
.SS Multiple ligations (walks)
.sp
Finally, a read pair may contain more than two alignments:
.INDENT 0.0
.INDENT 2.5
[image: A sequenced Hi-C molecule that was formed via multiple ligations]
[image]
A sequenced Hi\-C molecule that was formed via multiple ligations.UNINDENT
.UNINDENT
.sp
Molecules like these typically form via multiple ligation events and we call them
walks [1]\&. Currently, \fBpairtools parse\fP does not process such molecules and
tags them as type \fBWW\fP\&.
.SS Interpreting gaps between alignments
.sp
Reads that are only partially aligned to the genome can be interpreted in
two different ways. One possibility is to assume that this molecule
was formed via at least two ligations (i.e. it\(aqs a \fIwalk\fP) but the non\-aligned
part (a \fBgap\fP) was missing from the reference genome for one reason or another.
Another possibility is to simply ignore this gap (for example, because it could
be an insertion or a technical artifact), thus assuming that our
molecule was formed via a single ligation and has to be reported:
.INDENT 0.0
.INDENT 2.5
[image: A gap between alignments can be ignored or interpreted as a "null" alignment]
[image]
A gap between alignments can interpreted as a legitimate segment without
an alignment or simply ignored.UNINDENT
.UNINDENT
.sp
Both options have their merits, depending on a dataset, quality of the reference
genome and sequencing. \fBpairtools parse\fP ignores shorter \fIgaps\fP and keeps
longer ones as "null" alignments. The maximal size of ignored \fIgaps\fP is set by
the \fB\-\-max\-inter\-align\-gap\fP flag (by default, 20bp).
.SS Rescuing single ligations
.sp
Importantly, some of DNA molecules containing only one ligation junction
may still end up with three alignments:
.INDENT 0.0
.INDENT 2.5
[image: Not all read pairs with three alignments come from "walks"]
[image]
Not all read pairs with three alignments come from "walks".UNINDENT
.UNINDENT
.sp
A molecule formed via a single ligation gets three alignments when one of the
two ligated DNA pieces is shorter than the read length, such that that read on
the corresponding side sequences through the ligation junction and into the other
piece [2]\&. The amount of such molecules depends on the type of the restriction
enzyme, the typical size of DNA molecules in the Hi\-C library and the read
length, and sometimes can be considerable.
.sp
\fBpairtools parse\fP detects such molecules and \fBrescues\fP them (i.e.
changes their type from a \fIwalk\fP to a single\-ligation molecule). It tests
walks with three aligments using three criteria:
.INDENT 0.0
.INDENT 2.5
[image: The three criteria used for "rescue"]
[image]
The three criteria used to "rescue" three\-alignment walks: cis, point towards each other, short distance.UNINDENT
.UNINDENT
.INDENT 0.0
.IP 1. 3
On the side with two alignments (the \fBchimeric\fP side), the "inner" (or, 3\(aq)
alignment must be on the same chromosome as the alignment on the non\-chimeric
side.
.IP 2. 3
The "inner" alignment on the chimeric side and the alignment on the
non\-chimeric side must point toward each other.
.IP 3. 3
These two alignments must be within the distance specified with the
\fB\-\-max\-molecule\-size\fP flag (by default, 2000bp).
.UNINDENT
.sp
Sometimes, the "inner" alignment on the chimeric side can be non\-unique or "null"
(i.e. when the unmapped segment is longer than \fB\-\-max\-inter\-align\-gap\fP,
as described in \fI\%Interpreting gaps between alignments\fP). \fBpairtools parse\fP ignores such alignments
altogether and thus rescues such \fIwalks\fP as well.
.INDENT 0.0
.INDENT 2.5
[image: A walk with three alignments get rescued, when the middle alignment is multi- or null]
[image]
A walk with three alignments get rescued, when the middle alignment is multi\- or null..UNINDENT
.UNINDENT
.IP [1] 5
Following the lead of \fI\%C\-walks\fP
.IP [2] 5
This procedure was first introduced in \fI\%HiC\-Pro\fP
and the in \fI\%Juicer\fP .
.SH SORTING PAIRS
.sp
In order to enable efficient random access to Hi\-C pairs, we \fBflip\fP and \fBsort\fP pairs.
After sorting, interactions become arranged in the order of their genomic position,
such that, for any given pair of regions, we easily find and extract all of their interactions.
And, after flipping, all artificially duplicated molecules (either during PCR or
in optical sequencing) end up in adjacent rows in sorted lists of interactions,
such that we can easily identify and remove them.
.SS Sorting
.sp
\fBpairtools sort\fP arrange pairs in the order of (chrom1, chrom2, pos1, pos2).
This order is also known as \fIblock sorting\fP, because all pairs between
any given pair of chromosomes become grouped into one continuous block.
Additionally, \fBpairtools sort\fP also sorts pairs with identical positions by
\fIpair_type\fP\&. This does not really do much for mapped reads, but it nicely splits
unmapped reads into blocks of null\-mapped and multi\-mapped reads.
.sp
We note that there is an alternative to block sorting, called \fIrow sorting\fP,
where pairs are sorted by (chrom1, pos1, chrom2, pos2).
In \fIpairtools sort\fP, we prefer block\-sorting since it cleanly separates cis
interactions from trans ones and thus is a more optimal solution for typical
use cases.
.SS Flipping
.sp
In a typical paired\-end experiment, \fIside1\fP and \fIside2\fP of a DNA molecule are
defined by the order in which they got sequenced.
Since this order is essentially random, any given Hi\-C pair, e.g.
(chr1, 1.1Mb; chr2, 2.1Mb), may appear in a reversed orientation, i.e.
(chr2, 2.1Mb; chr1, 1.1Mb). If we were to preserve this order of sides, interactions
between same loci would appear in two different locations of the sorted pair list,
which would complicate finding PCR/optical duplicates.
.sp
To ensure that Hi\-C pairs with similar coordinates end up in the same location of the sorted list,
we \fBflip\fP pairs, i.e. we choose \fIside1\fP as the side with the lowest genomic coordinate.
Thus, after flipping, for \fItrans\fP pairs (chrom1!=chrom2), order(chrom1)<order(chrom2);
and for \fIcis\fP pairs (chrom1==chrom2), pos1<=pos2.
In a matrix representation, flipping is equal to reflecting the lower triangle
of the Hi\-C matrix onto its upper triangle, such that the resulting matrix
is upper\-triangular.
.sp
In \fIpairtools\fP, flipping is done during parsing \- that\(aqs why \fBpairtools parse\fP
requires a .chromsizes file that specifies the order of chromosomes for flipping.
Importantly, \fBpairtools parse\fP also flips one\-sided pairs such that
side1 is always unmapped; and unmapped pairs such that side1 always has a "poorer"
mapping type (i.e. null\-mapping<multi\-mapping).
.SS Chromosomal order for sorting and flipping
.sp
Importantly, the order of chromosomes for sorting and flipping can be different.
Specifically, \fBpairtools sort\fP uses the lexicographic order for chromosomes
(chr1, chr10, chr11, ..., chr2, chr21,...) instead of the "natural" order
(chr1, chr2, chr3, ...); at the same time, flipping is done in
\fBpairsamtools parse\fP using the chromosomal order specified by the user.
.sp
\fBpairtools sort\fP uses the lexicographic order for sorting chromosomes.
This order is used universally to sorting strings in all languages and tools [1],
which makes it easy to design tools for indexing and searching in sorted pair lists.
.sp
At the same time, \fBpairtools parse\fP uses a custom user\-provided chromosomal
order to flip pairs. For performance considerations, for flipping, we recommend
ordering chromosomes in a way that will be used in plotting contact maps.
.IP [1] 5
Unfortunately, many existing genomes use rather unconventional choices
in chromosomal naming schemes. For example, in sacCer3, chromosomes are
enumerated with Roman numerals; in dm3, big autosomes are split 4 different
contigs each. Thus, it is impossible to design a universal algorithm that
would order chromosomes in a "meaningful" way for all existing genomes.
.SH FORMATS FOR STORING HI-C PAIRS
.SS \&.pairs
.sp
\fI\&.pairs\fP is a simple tabular format for storing DNA contacts detected in
a Hi\-C experiment.  The detailed
\fI\%\&.pairs specification\fP
is defined by the 4DN Consortium.
.sp
The body of a .pairs contains a table with a variable number of fields separated by
a "\et" character (a horizontal tab). The .pairs specification fixes the content
and the order of the first seven columns:
.TS
center;
|l|l|l|.
_
T{
index
T}	T{
name
T}	T{
description
T}
_
T{
1
T}	T{
read_id
T}	T{
the ID of the read as defined in fastq files
T}
_
T{
2
T}	T{
chrom1
T}	T{
the chromosome of the alignment on side 1
T}
_
T{
3
T}	T{
pos1
T}	T{
the 1\-based genomic position of the outer\-most (5\(aq) mapped bp on side 1
T}
_
T{
4
T}	T{
chrom2
T}	T{
the chromosome of the alignment on side 2
T}
_
T{
5
T}	T{
pos2
T}	T{
the 1\-based genomic position of the outer\-most (5\(aq) mapped bp on side 2
T}
_
T{
6
T}	T{
strand1
T}	T{
the strand of the alignment on side 1
T}
_
T{
7
T}	T{
strand2
T}	T{
the strand of the alignment on side 2
T}
_
.TE
.sp
A .pairs file starts with a header, an arbitrary number of lines starting
with a "#" character. By convention, the header lines have a format of
"#field_name: field_value".
The \fI\%\&.pairs specification\fP
mandates a few standard header lines (e.g., column names,
chromosome order, sorting order, etc), all of which are
automatically filled in by \fIpairtools\fP\&.
.sp
The entries of a .pairs file can be flipped and sorted. "Flipping" means
that \fIthe sides 1 and 2 do not correspond to side1 and side2 in sequencing data.\fP
Instead, side1 is defined as the side with the
alignment with a lower sorting index (using the lexographic order for
chromosome names, followed by the numeric order for positions and the
lexicographic order for pair types). This particular order of "flipping" is
defined as "upper\-triangular flipping", or "triu\-flipping". Finally, pairs are
\fItypically\fP block\-sorted: i.e. first lexicographically by chrom1 and chrom2,
then numerically by pos1 and pos2.
.SS Pairtools\(aq flavor of .pairs
.sp
\&.pairs files produced by \fIpairtools\fP extend .pairs format in a few ways.
.INDENT 0.0
.IP 1. 3
\fIpairtools\fP store null/ambiguous/chimeric alignments as chrom=\(aq!\(aq, pos=0, strand=\(aq\-\(aq.
.IP 2. 3
\fIpairtools\fP store the header of the source .sam files in the
\(aq#samheader:\(aq fields of the pairs header. When multiple .pairs files are merged,
the respective \(aq#samheader:\(aq fields are checked for consistency and merged.
.IP 3. 3
Each pairtool applied to .pairs leaves a record in the \(aq#samheader\(aq fields
(using a @PG sam tag), thus preserving the full history of data processing.
.IP 4. 3
\fIpairtools\fP append an extra column describing the type of a Hi\-C pair:
.UNINDENT
.TS
center;
|l|l|l|.
_
T{
index
T}	T{
name
T}	T{
description
T}
_
T{
8
T}	T{
pair_type
T}	T{
the type of a Hi\-C pair
T}
_
.TE
.SS Pair types
.sp
\fIpairtools\fP use a simple two\-character notation to define all possible pair
types, according to the quality of alignment of the two sides. The type of a pair
can be defined unambiguously using the table below. To use this table,
identify which side has an alignment of a "poorer" quality
(unmapped < multimapped < unique alignment)
and which side has a "better" alignment and find the corresponding row in the table.
.TS
center;
|l|l|l|l|l|l|l|l|.
_
T{
\&.
T}	T{
Less informative alignment
T}	T{
More informative alignment
T}	T{
\&.
T}	T{
\&.
T}	T{
\&.
T}
_
T{
>2 alignments
T}	T{
Mapped
T}	T{
Unique
T}	T{
Mapped
T}	T{
Unique
T}	T{
Pair type
T}	T{
Code
T}	T{
Sidedness
T}
_
T{
✔
T}	T{
❌
T}	T{
❌
T}	T{
❌
T}	T{
❌
T}	T{
walk\-walk
T}	T{
WW
T}	T{
0 [1]
T}
_
T{
❌
T}	T{
❌
T}	T{
T}	T{
❌
T}	T{
T}	T{
null
T}	T{
NN
T}	T{
0
T}
_
T{
❌
T}	T{
❌
T}	T{
T}	T{
❌
T}	T{
T}	T{
corrupt
T}	T{
XX
T}	T{
0 
.nf
[2]_
.fi
T}
_
T{
❌
T}	T{
❌
T}	T{
T}	T{
✔
T}	T{
❌
T}	T{
null\-multi
T}	T{
NM
T}	T{
0
T}
_
T{
✔
T}	T{
❌
T}	T{
T}	T{
✔
T}	T{
✔
T}	T{
null\-rescued
T}	T{
NR
T}	T{
1 [3]
T}
_
T{
❌
T}	T{
❌
T}	T{
T}	T{
✔
T}	T{
✔
T}	T{
null\-unique
T}	T{
NU
T}	T{
1
T}
_
T{
❌
T}	T{
✔
T}	T{
❌
T}	T{
✔
T}	T{
❌
T}	T{
multi\-multi
T}	T{
MM
T}	T{
0
T}
_
T{
✔
T}	T{
✔
T}	T{
❌
T}	T{
✔
T}	T{
✔
T}	T{
multi\-rescued
T}	T{
MR
T}	T{
1 [3]
T}
_
T{
❌
T}	T{
✔
T}	T{
❌
T}	T{
✔
T}	T{
✔
T}	T{
multi\-unique
T}	T{
MU
T}	T{
1
T}
_
T{
✔
T}	T{
✔
T}	T{
✔
T}	T{
✔
T}	T{
✔
T}	T{
rescued\-unique
T}	T{
RU
T}	T{
2 [3]
T}
_
T{
✔
T}	T{
✔
T}	T{
✔
T}	T{
✔
T}	T{
✔
T}	T{
unique\-rescued
T}	T{
UR
T}	T{
2 [3]
T}
_
T{
❌
T}	T{
✔
T}	T{
✔
T}	T{
✔
T}	T{
✔
T}	T{
unique\-unique
T}	T{
UU
T}	T{
2
T}
_
T{
❌
T}	T{
✔
T}	T{
✔
T}	T{
✔
T}	T{
✔
T}	T{
duplicate
T}	T{
DD
T}	T{
2 
.nf
[4]_
.fi
T}
_
.TE
.IP [1] 5
"walks", or, \fI\%C\-walks\fP are
Hi\-C molecules formed via multiple ligation events which cannot be reported
as a single pair.
.IP [2] 5
"corrupt" pairs are those with technical issues \- e.g. missing a
FASTQ sequence/SAM entry from one side of the molecule.
.IP [2] 5
"rescued" pairs have two non\-overlapping alignments on one of the sides
(referred below as the chimeric side/read), but the inner (3\(aq\-) one extends the
only alignment on the other side (referred as the non\-chimeric side/read).
Such pairs form when one of the two ligated DNA fragments is shorter than
the read length. In this case, one of the reads contains this short fragment
entirely, together with the ligation junction and a chunk of the other DNA fragment
(thus, this read ends up having two non\-overlapping alignments).
Following the procedure introduced in \fI\%HiC\-Pro\fP
and \fI\%Juicer\fP, \fIpairtools parse\fP
rescues such Hi\-C molecules, reports the position of the 5\(aq alignment on the
chimeric side, and tags them as "NU", "MU", "UR" or "RU" pair type, depending
on the type of the 5\(aq alignment on the chimeric side. Such molecules can and
should be used in downstream analysis.
Read more on the rescue procedure in the section on parsing\&.
.IP [3] 5
\fIpairtools dedup\fP detects molecules that could be formed via PCR duplication and
tags them as "DD" pair type. These pairs should be excluded from downstream
analyses.
.SS \&.pairsam
.sp
\fIpairtools\fP also define .pairsam, a valid extension of the .pairs format.
On top of the pairtools\(aq flavor of .pairs, .pairsam format adds two extra
columns containing the alignments from which the Hi\-C pair was extracted:
.TS
center;
|l|l|l|.
_
T{
index
T}	T{
name
T}	T{
description
T}
_
T{
9
T}	T{
sam1
T}	T{
the sam alignment(s) on side 1; separate supplemental alignments by NEXT_SAM
T}
_
T{
10
T}	T{
sam2
T}	T{
the sam alignment(s) on side 2; separate supplemental alignments by NEXT_SAM
T}
_
.TE
.sp
Note that, normally, the fields of a sam alignment are separated by a horizontal
tab character (\et), which we already use to separate .pairs columns. To
avoid confusion, we replace the tab character in sam entries stored in sam1 and
sam2 columns with a UNIT SEPARATOR character (\e031).
.sp
Finally, sam1 and sam2 can store multiple .sam alignments, separated by a string
\(aq\e031NEXT_SAM\e031\(aq
.SH TECHNICAL NOTES
.sp
Designing scientific software and formats requires making a multitude of
tantalizing technical decisions and compromises. Often, the reasons behind a
certain decision are non\-trivial and convoluted, involving many factors.
Here, we collect the notes and observations made during the desing stage of
\fIpairtools\fP and provide a justification for most non\-trivial decisions.
We hope that this document will elucidate the design of \fIpairtools\fP and
may prove useful to developers in their future projects.
.SS \&.pairs format
.sp
The motivation behind some of the technical decisions in the pairtools\(aq flavor
of .pairs/.pairsam:
.INDENT 0.0
.IP \(bu 2
\fIpairtools\fP can store SAM entries together with the Hi\-C pair information in
\&.pairsam files. Storing pairs and alignments in the same row enables easy
tagging and filtering of paired\-end alignments based on their Hi\-C
information.
.IP \(bu 2
\fIpairtools\fP use the exclamation mark "!" instead of \(aq.\(aq as \(aqchrom\(aq of
unmapped reads because it has the lowest lexicographic sorting order among all
characters. The use of \(aq0\(aq and \(aq\-\(aq in the \(aqpos\(aq and \(aqstrand\(aq fields of unmapped
reads allows us to keep the types of these fields as \(aqunsigned int\(aq and
enum{\(aq+\(aq,\(aq\-\(aq}, respectively.
.IP \(bu 2
"rescued" pairs have two types "UR" and "RU" instead of just "RU". We chose
this design because rescued pairs are two\-sided and thus are flipped based on
(chrom, pos), and not based on the side types. With two pair types "RU" and "UR",
\fIpairtools\fP can keep track of which side of the pair was rescued.
.IP \(bu 2
in "rescued" pairs, the type "R" is assigned to the non\-chimeric side.
This may seem counter\-intuitive at first, since it is the chimeric side that
gets rescued, but this way \fIpairtools\fP can keep track of the type of the
5\(aq alignment on the chimeric side (the alignment on the non\-chimeric side
has to be unique for the pair to be rescued).
.IP \(bu 2
\fIpairtools\fP rely on a text format, .pairs, instead of hdf5/parquet\-based
tables or custom binaries. We went with a text format for a few reasons:
.INDENT 2.0
.IP \(bu 2
text tables enable easy access to data from any language and any tool.
This is especially important at the level of Hi\-C pairs, the "rawest"
format of information from a Hi\-C experiment.
.IP \(bu 2
hdf5 and parquet have a few shortcomings that hinder their immediate use
in \fIpairtools\fP\&. Specifically, hdf5 cannot compress variable\-length strings
(which are, in turn, required to store sam alignments and some optional
information on pairs) and parquet cannot append columns to existing files,
modify datasets in place or store multiple tables in one file (which is
required to keep table indices in the same file with pairs).
.IP \(bu 2
text tables have a set of well\-developed and highly\-optimized tools for
sorting (Unix sort), compression (bgzip/lz4) and random access (tabix).
.IP \(bu 2
text formats enable easy streaming between individual command\-line tools.
.UNINDENT
.sp
Having said that, text formats have many downsides \- they are bulky when
not compressed, compression and parsing requires extra computational
resources, they cannot be modified in place and random access requires extra
tools. In the future, we plan to develop a binary format based on existing
container formats, which would mitigate these downsides.
.UNINDENT
.SS CLI
.INDENT 0.0
.IP \(bu 2
many \fIpairtools\fP perform multiple actions at once, which contradicts the
"do one thing" philosophy of Unix command line. We packed multiple (albeit,
related) functions into one tool to improve the performance of \fIpairtools\fP\&.
Specifically, given the large size of Hi\-C data, a significant fraction of time
is spent on compression/decompression, parsing, loading data into memory and
sending it over network (for cloud/clusters). Packing multiple functions
into one tool cuts down the amount of such time consuming operations.
.IP \(bu 2
\fBpairtools parse\fP requires a .chromsizes file to know the order of chromosomes
and perform pair flipping.
.IP \(bu 2
\fIpairtools\fP use \fI\%bgzip\fP compression by
default instead of gzip. Using \fIbgzip\fP allows us to create an index with
\fI\%pairix\fP and get random access to data.
.IP \(bu 2
\fIparitools\fP have an option to compress outputs with
\fI\%lz4\fP\&.
\fI\%Lz4 is much faster and only slighly less efficient than gzip\fP\&.
This makes lz4 a better choice for passing data between individual pairtools
before producing final result (which, in turn, requires bgzip compression).
.UNINDENT
.SH COMMAND-LINE API
.INDENT 0.0
.IP \(bu 2
genindex
.UNINDENT
.SH AUTHOR
Mirny Lab
.SH COPYRIGHT
2017-2020, Mirny Lab
.\" Generated by docutils manpage writer.
.