.\" Man page generated from reStructuredText.
.
.TH "IGDISCOVER" "1" "Sep 06, 2019" "0.11" "IgDiscover"
.SH NAME
igdiscover \- IgDiscover Documentation
.
.nr rst2man-indent-level 0
.
.de1 rstReportMargin
\\$1 \\n[an-margin]
level \\n[rst2man-indent-level]
level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
-
\\n[rst2man-indent0]
\\n[rst2man-indent1]
\\n[rst2man-indent2]
..
.de1 INDENT
.\" .rstReportMargin pre:
. RS \\$1
. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
. nr rst2man-indent-level +1
.\" .rstReportMargin post:
..
.de UNINDENT
. RE
.\" indent \\n[an-margin]
.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
.nr rst2man-indent-level -1
.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
..
\fI\%\fP\fI\%\fP\fI\%\fP
.sp
IgDiscover analyzes antibody repertoires and discovers new V genes from high\-throughput sequencing reads.
Heavy chains, kappa and lambda light chains are supported (to discover VH, VK and VL genes).
.sp
IgDiscover is the result of a collaboration between the \fI\%Gunilla Karlsson Hedestam group\fP
at the \fI\%Department of Microbiology, Tumor and Cell Biology\fP at \fI\%Karolinska Institutet\fP,
Sweden and the \fI\%Bioinformatics Long\-Term Support\fP facility
at \fI\%Science for Life Laboratory (SciLifeLab)\fP, Sweden.
.sp
If you use IgDiscover, please cite:
.INDENT 0.0
.INDENT 3.5
.nf
Corcoran, Martin M. and Phad, Ganesh E. and Bernat, Néstor Vázquez and Stahl\-Hennig,
Christiane and Sumida, Noriyuki and Persson, Mats A.A. and Martin, Marcel and
Karlsson Hedestam, Gunilla B.
\fIProduction of individualized V gene databases reveals high levels of immunoglobulin genetic
diversity.\fP
Nature Communications 7:13642 (2016)
\fI\%https://dx.doi.org/10.1038/ncomms13642\fP
.fi
.sp
.UNINDENT
.UNINDENT
.SH LINKS
.INDENT 0.0
.IP \(bu 2
\fI\%Documentation\fP
.IP \(bu 2
\fI\%Source code\fP
.IP \(bu 2
\fI\%Report an issue\fP
.IP \(bu 2
\fI\%Project page on PyPI (Python package index)\fP
.UNINDENT
.nf

.fi
.sp
.INDENT 0.0
.INDENT 2.5
[image]
.UNINDENT
.UNINDENT
.nf

.fi
.sp
.SH INSTALLATION
.sp
IgDiscover is written in Python 3 and is developed on Linux. The tool also
runs on macOS, but is not as well tested on that platform.
.sp
For installation on either system, we recommend that you follow the instructions
below, which will first explain how to install the \fI\%Conda\fP
package manager. IgDiscover is available as a
Conda\-package from \fI\%the bioconda channel\fP\&.
Using Conda will make the installation easy because all dependencies are also
available as Conda packages and can thus be installed automatically along with
IgDiscover.
.sp
There are also non\-Conda installation instructions
if you cannot use Conda.
.SS Installing IgDiscover with Conda
.INDENT 0.0
.IP 1. 3
Install \fI\%Conda\fP by following the \fI\%conda installation
instructions\fP
as appropriate for your system. You will need to choose between a “Miniconda”
and “Anaconda” installation. We recommend Miniconda as the download is
smaller. If you are in a hurry, these two commands are usually sufficient to
install Miniconda on Linux (read the linked document for macOS instructions):
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
wget https://repo.continuum.io/miniconda/Miniconda3\-latest\-Linux\-x86_64.sh
bash Miniconda3\-latest\-Linux\-x86_64.sh
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
When the installer asks you about modifying the \fBPATH\fP in your \fB\&.bashrc\fP
file, answer \fByes\fP\&.
.IP 2. 3
Close the terminal window and open a new one. Then test whether conda is
installed correctly by running
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
conda \-\-version
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If you see the conda version number, it worked.
.IP 3. 3
Set up Conda so that it can access the
\fI\%bioconda channel\fP\&.
For that, follow \fI\%the instructions on the bioconda
website\fP
or simply run these commands:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
conda config \-\-add channels defaults
conda config \-\-add channels bioconda
conda config \-\-add channels conda\-forge
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 4. 3
Install IgDiscover with this command:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
conda create \-n igdiscover igdiscover
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This will create a new so\-called “environment” for IgDiscover (retry if it fails). \fBWhenever you
want to run IgDiscover, you will need to activate the environment with this
command\fP:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
source activate igdiscover
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 5. 3
Make sure you have activated the \fBigdiscover\fP environment.
Then test whether IgDiscover is correctly installed with this command:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
igdiscover \-\-version
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If you see the version number of IgDiscover, it worked! If an error message appears that says
"The \(aqnetworkx\(aq distribution was not found and is required by snakemake", install networkx manually with:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
pip install networkx==2.1
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Then retry to check the igdiscover version.
.IP 6. 3
You can now run IgDiscover on the test data set to familiarize
yourself with how it works.
.UNINDENT
.SS Troubleshooting on Linux
.sp
If you use \fBzsh\fP instead of \fBbash\fP (applies to Bio\-Linux, for example),
the \fB$PATH\fP environment variable will not be setup correctly by the
Conda installer. The miniconda installer adds a line \fBexport PATH=...\fP to the
to the end of your \fB/home/your\-user\-name/.bashrc\fP file. Copy that line from
the file and add it to the end of the file \fB/home/your\-user\-name/.zshrc\fP
instead.
.sp
Alternatively, change your default shell to bash by running
\fBchsh \-s /bin/bash\fP\&.
.sp
If you use conda and see an error that includes something like this:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
ImportError: .../.local/lib/python3.5/site\-packages/sqt/_helpers.cpython\-35m\-x86_64\-linux\-gnu.so: undefined symbol: PyFPE_jbuf
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Or you see any error that mentions a \fB\&.local/\fP directory, then a previous
installation of IgDiscover is interfering with the conda installation.
.sp
The easiest way to solve this problem is to delete the directory \fB\&.local/\fP in
your home directory, see also how to remove IgDiscover from a Linux
system\&.
.SS Troubleshooting on macOS
.sp
If you get the error
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
ValueError: unknown locale: UTF\-8
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Then follow \fI\%these instructions\fP\&.
.SS Development version
.sp
To install IgDiscover directly from the most recent source code,
read the developer installation instructions\&.
.SH MANUAL INSTALLATION
.sp
IgDiscover requires quite a few other software tools that are not included in most Linux
distributions (or mac OS) and which are also not available from the Python packaging
index (PyPI) because they are not Python tools. If you do not use the recommended simple
installation instructions via Conda, you need to install those non\-Python
dependencies manually. Regular Python dependencies are automatically pulled in when IgDiscover
itself is installed in the last step with the \fBpip install\fP command. The instructions below are
written for Linux and require modifications if you want to try this on OS X.
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
We recommend the much simpler installation via Conda
instead of using the instructions in this section.
.UNINDENT
.UNINDENT
.SS Install non\-Python dependencies
.sp
The dependencies are: MUSCLE, IgBLAST, PEAR, and \-\- optionally \-\- flash.
.INDENT 0.0
.IP 1. 3
Install Python 3.5 or newer. It most likely is already installed on your system, but
in Debian/Ubuntu, you can get it with
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
sudo apt\-get install python3
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 2. 3
Create the directory where binaries will be installed. We assume
\fB$HOME/.local/bin\fP here, but this can be anywhere as long as they are in
your \fB$PATH\fP\&.
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
mkdir \-p ~/.local/bin
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Add this line to the end of your \fB~/.bashrc\fP file:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
export PATH=$HOME/.local/bin:$PATH
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Then either start a new shell or run \fBsource ~/.bashrc\fP to get the changes.
.IP 3. 3
Install MUSCLE. This is available as a package in Ubuntu:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
sudo apt\-get install muscle
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If your distribution does not have a \(aqmuscle\(aq package or if you are not allowed
to run \fBsudo\fP:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
wget \-O \- http://www.drive5.com/muscle/downloads3.8.31/muscle3.8.31_i86linux64.tar.gz | tar xz
mv muscle3.8.31_i86linux64 ~/.local/bin/
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 4. 3
Install PEAR:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
wget http://sco.h\-its.org/exelixis/web/software/pear/files/pear\-0.9.6\-bin\-64.tar.gz
tar xvf pear\-0.9.6\-bin\-64.tar.gz
mv pear\-0.9.6\-bin\-64/pear\-0.9.6\-bin\-64 ~/.local/bin/pear
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 5. 3
Install IgBLAST:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
wget ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/1.4.0/ncbi\-igblast\-1.4.0\-x64\-linux.tar.gz
tar xvf ncbi\-igblast\-1.4.0\-x64\-linux.tar.gz
mv ncbi\-igblast\-1.4.0/bin/igblast? ~/.local/bin/
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
IgBLAST requires some data files that must be downloaded separately. The
following commands put the files into \fB~/.local/igdata\fP:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
mkdir ~/.local/igdata
cd ~/.local/igdata
wget \-r \-nH \-\-cut\-dirs=4 ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/internal_data
wget \-r \-nH \-\-cut\-dirs=4 ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/database/
wget \-r \-nH \-\-cut\-dirs=4 ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/optional_file/
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Also, you must set the \fB$IGDATA\fP environment variable to point to the
directory with data files. Add this line to your \fB~/.bashrc\fP:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
export IGDATA=$HOME/.local/igdata
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Then run \fBsource ~/.bashrc\fP to get the changes.
.UNINDENT
.INDENT 0.0
.IP 7. 3
Optionally, install flash:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
wget \-O FLASH\-1.2.11.tar.gz http://sourceforge.net/projects/flashpage/files/FLASH\-1.2.11.tar.gz/download
tar xf FLASH\-1.2.11.tar.gz
cd FLASH\-1.2.11
make
mv flash ~/.local/bin/
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.SS Install IgDiscover
.sp
Install IgDiscover with the Python package manager \fBpip\fP, which will download and install IgDiscover and its
dependencies:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
pip3 install \-\-user igdiscover
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Both commands also install all remaining dependencies. The \fB\-\-user\fP option
instructs both commands to install everything into \fB$HOME/.local\fP\&.
.sp
Finally, check the installation with
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
igdiscover \-\-version
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
and you should see the version number of IgDiscover.
.sp
You should now run IgDiscover on the test data set\&.
.SH TEST DATA SET
.sp
After installing IgDiscover, you should run it once on a small test data that we
provide, both to test your installation and to familiarize yourself with
running the program.
.INDENT 0.0
.IP 1. 3
Download und unpack \fI\%the test data set (version 0.5)\fP\&. To do this
from the command\-line, use these commands:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
wget https://bitbucket.org/igdiscover/testdata/downloads/igdiscover\-testdata\-0.5.tar.gz
tar xvf igdiscover\-testdata\-0.5.tar.gz
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
The test data set contains some paired\-end reads from human IgM heavy chain
dataset ERR1760498 and a database of IGHV, IGHD, IGHJ sequences based on
Ensembl annotations. You should use a database of higher quality for your
own experiments.
.UNINDENT
.UNINDENT
.INDENT 0.0
.IP 2. 3
Initialize the IgDiscover pipeline directory:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
igdiscover init \-\-db igdiscover\-testdata/database/ \-\-reads igdiscover\-testdata/reads.1.fastq.gz discovertest
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The name \fBdiscovertest\fP is the name of the pipeline directory that will be
created. Note that only the path to the \fIfirst\fP reads file needs to be
given. The second file is found automatically. There may be a couple of
messages “Skipping \(aqx\(aq because it contains the same sequence as \(aqy\(aq”, which
you can ignore.
.sp
The command will have printed a message telling you that the pipeline
directory has been initialized, that you should edit the configuration file,
and how to actually run IgDiscover after that.
.IP 3. 3
The generated \fBigdiscover.yaml\fP configuration file does not actually need
to be edited for the test dataset, but you may still want to have a read
through it as you will need to do so for you own data. You may want to do
this while the pipeline is running in the next step. The configuration is in
YAML format. When editing the file, just follow the way it is already
structured.
.IP 4. 3
Run the analysis. To do so, change into the pipeline directory and run this
command:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
cd discovertest && igdiscover run
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
On this small dataset, running the pipeline should take not more than about 5 minutes.
.IP 5. 3
Finally, inspect the results in the \fBdiscovertest/iteration\-01\fP or
\fBdiscovertest/final\fP directories. The discovered V genes and extra
information are listed in
\fBdiscovertest/iteration\-01/new_V_germline.tab\fP\&. Discovered J genes are
in \fBdiscovertest/iteration\-01/new_J.tab\fP\&. There are also corresponding
\fB\&.fasta\fP files with the sequences only.
.sp
See the explanation of final result files\&.
.UNINDENT
.SS Other test data sets
.sp
ENA project \fI\%PRJEB15295\fP contains the data for
our Nature Communications paper from 2016, in particular
\fI\%ERR1760498\fP, which is the data for the human “H1”
sample (multiplex PCR, IgM heavy chain).
.sp
Data used for testing TCR detection (human, RACE): \fI\%SRR2905677\fP and
\fI\%SRR2905710\fP\&.
.SH USER GUIDE
.SS Overview
.sp
IgDiscover works on a single library at a time. It works within an
“analysis directory” for the library, which contains all intermediate
and result files.
.sp
To start an analysis, you need:
.INDENT 0.0
.IP 1. 3
A FASTA or FASTQ file with single\-end reads or two FASTQ files with
paired\-end reads (also, the files must be gzip\-compressed)
.IP 2. 3
A database of V/D/J genes (three FASTA files named \fBV.fasta\fP, \fBD.fasta\fP, \fBJ.fasta\fP)
.IP 3. 3
A configuration file that describes the library
.UNINDENT
.sp
If you do not have a V/D/J database, yet, you may want to read the section about
\fI\%how to obtain V/D/J sequences\fP\&.
.sp
To run an analysis, proceed as follows.
.sp
\fBNOTE:\fP
.INDENT 0.0
.INDENT 3.5
If you are on macOS, it may be necessary to run \fBexport SHELL=/bin/bash\fP before continuing.
.UNINDENT
.UNINDENT
.INDENT 0.0
.IP 1. 3
Create and initialize the analysis directory.
.sp
First, pick a name for your analysis. We will use \fBmyexperiment\fP in the following.
Then run \fBigdiscover init\fP:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
igdiscover init myexperiment
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
A dialog will appear and ask for the file with the \fIfirst\fP (forward) reads.
Find your compressed FASTQ file that contains them and select it.
Typical file names may be \fBLibrary1_S1_L001_R1_001.fastq.gz\fP or \fBmylibrary.1.fastq.gz\fP\&.
You do not need to choose the second read file!
It is found automatically.
.sp
Next, choose the directory with your database.
The directory must contain the three files \fBV.fasta\fP, \fBD.fasta\fP, \fBJ.fasta\fP\&.
These files contain the V, D, J gene sequences, respectively.
Even if have have only light chains in your data, a \fBD.fasta\fP file needs to be provided,
just use one with the heavy chain D gene sequences.
.sp
If you do not want a graphical user interface, use the two command\-line
parameters \fB\-\-db\fP and \fB\-\-reads1\fP to provide this information instead:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
igdiscover init \-\-db path/to/my/database/ \-\-reads1 mylibrary.1.fastq.gz myexperiment
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Again, the second reads file will be found automatically.
Use \fB\-\-single\-reads\fP instead of \fB\-\-reads1\fP if you have single\-end reads or a dataset with already merged reads.
For \fB\-\-single\-reads\fP, a FASTA file (not only FASTQ) is also allowed.
In any case, an analysis directory named \fBmyexperiment\fP will have been created.
.IP 2. 3
Adjust the configuration file
.sp
The previous step created a configuration file named \fBmyexperiment/igdiscover.yaml\fP, which
you may \fI\%need to adjust\fP\&. In particular, the number of discovery rounds
is set to 3 by default, which takes a long time. Reducing this to 2 or even 1 often works just
as well.
.IP 3. 3
Run the analysis
.sp
Change into the newly created analysis directory and run the analysis:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
igdiscover run
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Depending on the size of your library, your computer, and the number of iterations, this will
now take from a few hours to a day. See the \fI\%running IgDiscover\fP section for
more fine\-grained control over what to run and how to resume the process if something failed.
.UNINDENT
.SS Obtaining a V/D/J database
.sp
We use the term “database” to refer to three FASTA files that contain the sequences for the V, D
and J genes.
IMGT provides \fI\%sequences for download\fP\&.
For discovering new VH genes, for example, you need to get the IGHV, IGHD and IGHJ files of your species.
As IgDiscover uses this only as a starting point, using a similar species will also work.
.sp
When using an IMGT database, it is very important to change the long IMGT sequence headers to
short headers as IgBLAST does not accept the long headers. We recommend using the program
\fBedit_imgt_file.pl\fP\&. If you installed IgDiscover from Conda, the script is already installed and
you can run it by typing the name. It is also
\fI\%available on the IgBlast FTP site\fP\&.
.sp
Run it for all three downloaded files, and then rename files appropritely to make sure that they
named \fBV.fasta\fP, \fBD.fasta\fP and \fBJ.fasta\fP\&.
.sp
You always need a file with D genes even if you analyze light chains.
.sp
In case you have used IgBLAST previously, note that there is \fIno need\fP to run the \fBmakeblastdb\fP
tool as IgDiscover will do that for you.
.SS Input data requirements
.SS Paired\-end or single\-end data
.sp
IgDiscover can process input data of three different types:
.INDENT 0.0
.IP \(bu 2
Paired\-end reads in gzipped FASTQ format,
.IP \(bu 2
Single\-end reads in gzipped FASTQ format,
.IP \(bu 2
Single\-end reads in gzipped FASTA format.
.UNINDENT
.sp
IgDiscover was tested mainly on paired\-end Illumina MiSeq reads (2x300bp), but it can also handle
454 and Ion Torrent data.
.sp
Depending on the input file type, use a variant of one of the following commands to initialize
the analysis directory:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
igdiscover init \-\-single\-reads=file.fasta.gz  \-\-database=my\-database\-dir/ myexperiment
igdiscover init \-\-reads1=file.1.fasta.gz  \-\-database=my\-database\-dir/ myexperiment
igdiscover init \-\-reads1=file.1.fastq.gz  \-\-database=my\-database\-dir/ myexperiment
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Read layout
.sp
Paired\-end reads are first merged and then processed in the same way as single\-end reads. Reads
that could not be merged are discarded. Single\-end reads and merged paired\-end reads are expected
to follow this structure (from 5\(aq to 3\(aq):
.INDENT 0.0
.IP \(bu 2
The forward primer sequence. This is optional.
.IP \(bu 2
A random barcode (molecular identifier). This is optional. Set the
configuration option \fBbarcode_length_5p\fP to 0 if you don’t have random barcodes
or if you don’t want the program to use them.
.IP \(bu 2
Optionally, a run of G nucleotides. This is an artifact of the RACE protocol (Rapid
amplification of cDNA ends). If you have this, set \fBrace_g\fP to \fBtrue\fP in the configuration file.
.IP \(bu 2
5\(aq UTR
.IP \(bu 2
Leader
.IP \(bu 2
Re\-arranged V, D and J gene sequences for heavy chains; only V and J for light chains
.IP \(bu 2
An optional random barcode. Set the configuration option \fBbarcode_length_3p\fP to the length of
this barcode. You can currently not have both a 5\(aq and a 3\(aq barcode.
.IP \(bu 2
The reverse primer. This is optional.
.UNINDENT
.sp
We use IgBLAST to detect the location of the V, D, J genes through the
\fBigdiscover igblast\fP subcommand. The G nucleotides
after the barcode are split off if the configuration specifies
\fBrace_g: true\fP\&. The leader sequence is detected by looking for a start
codon near 60 bp upstream of the start of the V gene match.
.SS Configuration
.sp
The \fBigdiscover init\fP command creates a configuration file
\fBigdiscover.yaml\fP in the analysis directory. To configure
your analysis, change that file with a text editor before
running the analysis with \fBigdiscover run\fP\&.
.sp
The syntax should be mostly self\-explanatory.
The file is in YAML format, but you will not need to learn that.
Just follow the examples given in the file.
A few rules that may be good to know are the following ones:
.INDENT 0.0
.IP 1. 3
Lines starting with the \fB#\fP symbol are comments (they are ignored)
.IP 2. 3
A configuration option that is meant to be switched on or off will say something like \fBstranded: false\fP if it is off.
Change this to \fBstranded: true\fP to switch the option on (and vice versa).
.IP 3. 3
The primer sequences are given as a list, and must be written in a certain way \- one sequence per line, and a \fB\-\fP (dash) in front, like so:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
forward_primers:
\- ACGTACGTACGT
\- AACCGGTTAACC
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Even if you have only one primer sequence, you still need to use this syntax.
.UNINDENT
.sp
To find out what the configuration options achieve, see the explanations in the configuration file itself.
.sp
The main parameters parameters that may require adjusting are the following.
.sp
The \fBiterations\fP option sets the number of rounds of V gene discovery
that will be performed. By default, three iterations are run. Even with a very restricted
starting V database (for example with only a single V gene sequence),
this is usually sufficient to identify most novel germline sequences.
.sp
When the starting database is more complete, for example, when analyzing
a human IgM library with the current IMGT heavy chain database, a single
iteration may be sufficient to produce an individualized database.
.sp
If you do not want to discover any new genes and only want to produce an
expression profile, for example, then use \fBiterations: 0\fP\&.
.sp
The \fBignore_j\fP option should be set to \fBtrue\fP when producing a V gene
database for a species where J sequences are unknown:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
ignore_j: true
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Setting the parameters \fBstranded\fP, \fBforward_primers\fP and \fBreverse_primers\fP
to the correct values can be used to remove 5\(aq and 3\(aq primers from the sequences.
Doing this is not strictly necessary for IgDiscover. It is simplest
if you do not specify any primer sequences.
.SS Pregermline and germline filter criteria
.sp
This provides IgDiscover with stringency requirements for V gene discovery
that enable the program to filter out false positives. Usually the ”pregermline
filter” can be used in the default mode since all these sequences will be
subsequently passed to the higher stringency ”germline filter” where the
criteria are set to maximize stringency. Here is how it looks in the configuration
file:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
pre_germline_filter:
  unique_cdr3s: 2      # Minimum number of unique CDR3s (within exact matches)
  unique_js: 2         # Minimum number of unique J genes (within exact matches)
  check_motifs: false  # Check whether 5\(aq end starts with known motif
  whitelist: true      # Add database sequences to the whitelist
  cluster_size: 0      # Minimum number of sequences assigned to cluster
  differences: 0       # Merge sequences if they have at most this number of differences
  allow_stop: true     # Whether to allow non\-productive sequences containing stop codons
  cross_mapping_ratio: 0.02  # Threshold for removal of cross\-mapping artifacts (set to 0 to disable)
  allele_ratio: 0.1    # Required minimum ratio between alleles of a single gene

# Filtering criteria applied to candidate sequences in the last iteration.
# These should be more strict than the pre_germline_filter criteria.
#
germline_filter:
  unique_cdr3s: 5      # Minimum number of unique CDR3s (within exact matches)
  unique_js: 3         # Minimum number of unique J genes (within exact matches)
  check_motifs: false  # Check whether 5\(aq end starts with known motif
  whitelist: true      # Add database sequences to the whitelist
  cluster_size: 100    # Minimum number of sequences assigned to cluster
  differences: 0       # Merge sequences if they have at most this number of differences
  allow_stop: false    # Whether to allow non\-productive sequences containing stop codons
  cross_mapping_ratio: 0.02  # Threshold for removal of cross\-mapping artifacts (set to 0 to disable)
  allele_ratio: 0.1    # Required minimum ratio between alleles of a single gene
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Factors that affect germline discovery include library source (IgM vs IgK, IgL or IgG)
library size, sequence error rate and individual genomic factors (for example the
number of J segments present in an individual).
.sp
In general, setting a higher cutoff of \fBunique_cdr3s\fP and \fBunique_js\fP will minimize the number
of false positives in the output. Example:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
unique_cdr3s: 10      # Minimum number of unique CDR3s (within exact matches)
unique_js: 4          # Minimum number of unique J genes (within exact matches)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If the \fBdifferences\fP parameter is set to a value higher than 0, the germline filter inspects
clusters of sequences that are closely related (when the edit distance between them is at
most \fBdifferences\fP) and retains only the most common sequence of each cluster. Previously, we
believed this would removes some false positives due to accumulated random sequence errors of highly
expressed alleles that otherwise would pass the cutoff criteria. However, we found out that we miss
true positives, in particular if there are two alleles in the sample that differ in only a single
nucleotide. We have now implemented other measures to avoid false positives and recommend against
setting the \fBdifferences\fP to something other than \fB0\fP\&.
.sp
Read also about the \fI\%cross mapping\fP, for which germline filtering corrects, and
about the \fI\%germline filters\fP\&.
.sp
Changed in version The: default for the \fBdifferences\fP setting was changed from 1 to 0.

.SS Running IgDiscover
.SS Resuming failed runs
.sp
The command \fBigdiscover run\fP, which is used to start the pipeline, can also be used to resume
execution if there was an interruption (a transient failure). Reasons for interruptions might be:
.INDENT 0.0
.IP \(bu 2
Ctrl+C was pressed on the keyboard
.IP \(bu 2
A full harddisk
.IP \(bu 2
If running on a cluster, the program may have been terminated because it exceeded its allocated
time
.IP \(bu 2
Too little RAM
.IP \(bu 2
Power loss
.UNINDENT
.sp
To resume execution after you have fixed the problem, go to the analysis directory and run
\fBigdiscover run\fP again. It will skip the steps that have already finished successfully.
This capability comes from the workflow management system
\fI\%snakemake\fP, on which \fBigdiscover run\fP is based.
Snakemake will determine automatically which steps need to be re\-run in order to get to a full
result and then run only those.
.sp
Alterations to the configuration file after an interruption are possible, but affect only
steps that have not already finished successfully. For example, assume you interrupted a
run with Ctrl+C after it is already past the step in which barcodes are removed. Then,
even if you change the barcode length in the configuration, the barcode removal step will
not be re\-run when you resume the pipeline and the previous barcode length is in effect.
See also the next section.
.SS Changing parameters and re\-running parts of the pipeline
.sp
When you experiment with parameters in the \fBigdiscover.yaml\fP file, such as
germline filtering criteria, you do not need to re\-run the entire pipeline from
the beginning, but can re\-use the results that already exist. This can save a lot
of processing time, in particular when you avoid re\-running IgBLAST in this way.
.sp
As described in the previous section, \fBigdiscover run\fP automatically figures out
which files need to be re\-created if a run was interrupted. Unfortunately, this
mechanism is currently not smart enough to also look for changes in the
\fBigdiscover.yaml\fP file. Thus, if the full pipeline has finished successfully,
then re\-running \fBigdiscover run\fP will just print the message \fBNothing to be done.\fP
even after you have changed the configuration file.
.sp
You will therefore need to know yourself which file you want to regenerate.
Then follow the following steps. Note that these will remove parts of the existing
results, and if you need to keep them, make a copy of your analysis directory first.
.INDENT 0.0
.IP 1. 3
Change the configuration setting.
.IP 2. 3
Delete the file that needs to be re\-generated. Assume it is \fBfilename\fP
.IP 3. 3
Run \fBigdiscover run filename\fP to re\-create the file. Only that file
will be created, not the ones that usually would be created afterwards.
.IP 4. 3
Optionally, run \fBigdiscover run\fP (without a file name this time) to
update the remaining files (those that depend on the file that was just
updated).
.UNINDENT
.sp
For example, assume you want to modify some germline filtering setting and then re\-run
the pipeline. Change the setting in your \fBigdiscover.yaml\fP, then run these
commands:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
rm iteration\-01/new_V_germline.tab
igdiscover run iteration\-01/new_V_germline.tab
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The above will have regenerated the \fBiteration\-01/new_V_germline.tab\fP file
and also the \fBiteration\-01/new_V_germline.fasta\fP file since they are
generated by the same script. If you want to update any other files, then also
run
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
igdiscover run
.ft P
.fi
.UNINDENT
.UNINDENT
.SS The analysis directory
.sp
IgDiscover writes all intermediate files, the final V gene database, statistics and plots into
the analysis directory that was created with \fBigdiscover init\fP\&.
Inside that directory, there is a \fBfinal/\fP subdirectory that contains the analysis results.
.sp
These are the files and subdirectories that can be found in the analysis directory.
Subdirectories are described in detail below.
.INDENT 0.0
.TP
.B igdiscover.yaml
The configuration file.
Make sure to adjust this to your needs as described above.
.TP
.B reads.1.fastq.gz, reads.2.fastq.gz
Symbolic links to the raw paired\-end reads.
.TP
.B database/
The input V/D/J database (as three FASTA files).
The files are a copy of the ones you selected when running \fBigdiscover init\fP\&.
.TP
.B reads/
Processed reads (merged, de\-duplicated etc.)
.TP
.B iteration\-xx/
Iteration\-specific analysis directory, where “xx” is a number starting from 01.
Each iteration is run in one of these directories.
The first iteration (in \fBiteration\-01\fP) uses the original input database, which is also found in the \fBdatabase/\fP directory.
The database is updated and then used as input for the next iteration.
.TP
.B final/
After the last iteration, IgBLAST is run again on the input sequences, but using the final database (the one created in the very last iteration).
This directory contains all the results, such as plots of the repertoire profiles.
If you set the number of iterations to 0 in the configuration file, this directory is the only one that is created.
.UNINDENT
.SS Final results
.sp
Final results are found in the \fBfinal/\fP subdirectory of the analysis directory.
.INDENT 0.0
.TP
.B final/database/(V,D,J).fasta
These three files represent the final, individualized V/D/J database found by IgDiscover.
The D and J files are copies of the original starting database;
they are not updated by IgDiscover.
.TP
.B final/dendrogram_(V,D,J).pdf
These three PDF files contain dendrograms of the V, D and J sequences in the individualized
database.
.TP
.B final/assigned.tab.gz
V/D/J gene assignments and other information for each sequence.
The file is created by parsing the IgBLAST output in the \fBigblast.txt.gz\fP file.
This is a table that contains one row for each input sequence.
See below for a detailed description of the columns.
.TP
.B final/filtered.tab.gz
Filtered V/D/J gene assignments. This is the same as the assigned.tab file mentioned above, but with low\-quality assignments filtered out.
Run \fBigdiscover filter \-\-help\fP to see the filtering criteria.
.TP
.B final/expressed_(V,D,J).tab, final/expressed_(V,D,J).pdf
The V, D and J gene expression counts. Some assignments are filtered out to reduce artifacts. In particular,
an allele\-ratio filter of 10% is applied. For D genes, only those with an E\-value of at most
1E\-4 and a coverage of at least 70% are counted. See also the help for the \fBigdiscover count\fP
subcommand, which is used to create these files.
.sp
The \fB\&.tab\fP file contains the counts as a table, while the PDF file contains a plot of the same values.
.sp
These tables also exist in the iteration\-specific directories (\fBiteration\-xx\fP). For those,
note that the numbers do not include the genes that were discovered in that iteration. For
example, \fBiteration\-01/expressed_V.tab\fP shows only expression counts of the V genes in the
starting database.
.TP
.B final/errorhistograms.pdf
A PDF with one page per V gene/allele.
Each page shows a histogram of the percentage differences for that gene.
.TP
.B final/clusterplots/
This is a directory that contains one PNG file for each discovered gene/allele.
Each image shows a clusterplot of all the sequences assigned to that gene.
Note that the shown clusterplots are by default restricted to showing only at most 300 sequences,
while the actual clustering used by IgDiscover uses 1000 sequences.
.UNINDENT
.sp
If you are interested in the results of each iteration, you can inspect the iteration\-xx/ directories.
They are structured in the same way as the final/ subdirectory, except that the results are based on the intermediate databases of that iteration.
They also contain the following additional files.
.INDENT 0.0
.TP
.B iteration\-xx/candidates.tab
A table with candidate novel V alleles (or genes).
This is a list of sequences found through the \fIwindowing strategy\fP or \fIlinkage cluster analysis\fP, as discussed in our paper. See \fI\%the full description of candidates.tab\fP\&.
.TP
.B iteration\-xx/read_names_map.tab
For each candidate novel V allele listed in \fBcandidates.tab\fP, this file contains one row that
lists which sequences went into generating this candidate. Only the exact matches are listed,
that is, the number of listed sequence names should be equal to the value in the \fIexact\fP
column. Each line in this file contains tab\-separated values. The first is name of the
candidate, the others are the names of the sequences. Some of these sequences may be consensus
sequences if barcode grouping was enabled, so in that case, this will not be a read name.
.TP
.B iteration\-xx/new_V_germline.fasta, iteration\-xx/new_V_pregermline.fasta
The discovered list of V genes for this iteration.
The file is created from the \fBcandidates.tab\fP file by applying either the germline or pre\-germline filter.
The file resulting from application of the germline filter is used in the last iteration only.
The file resulting from application of the pre\-germline filter is used in earlier iterations.
.TP
.B iteration\-xx/annotated_V_germline.tab, iteration\-xx/annotated_V_pregermline.tab
A version of the \fBcandidates.tab\fP file that is annotated with extra columns that describe why a candidate was filtered out. See \fI\%the description of this file\fP\&.
.UNINDENT
.SS Other files
.sp
For completeness, here is a description of the files in the \fBreads/\fP and \fBstats/\fP directories.
They are created during pre\-processing and are not iteration specific.
.INDENT 0.0
.TP
.B reads/1\-limited.1.fastq.gz, reads/1\-limited.1.fastq.gz
Input reads file limited to the first N entries. This is just a symbolic
link to the input file if the \fBlimit\fP configuration option is not set.
.TP
.B reads/2\-merged.fastq.gz
Reads merged with PEAR or FLASH
.TP
.B reads/3\-forward\-primer\-trimmed.fastq.gz
Merged reads with 5\(aq primer sequences removed. (This file is automatically removed when
it is not needed anymore.)
.TP
.B reads/4\-trimmed.fastq.gz
Merged reads with 5\(aq and 3\(aq primer sequences removed.
.TP
.B reads/5\-filtered.fasta
Merged, primer\-trimmed sequences converted to FASTA, and too short sequences removed.
(This file is automatically removed when it is not needed anymore.)
.TP
.B reads/sequences.fasta.gz
Fully pre\-processed sequences. That is, filtered sequences without duplicates (using VSEARCH)
.TP
.B stats/reads.txt
Statistics of pre\-processed sequences.
.TP
.B stats/readlengths.txt, stats/readlengths.pdf
Histogram of the lengths of pre\-processed sequences (created from \fBreads/sequences.fasta\fP)
.UNINDENT
.SS Format of output files
.SS assigned.tab.gz
.sp
This file is a gzip\-compressed table with tab\-separated values. It is created by
the \fBigdiscover igblast\fP subcommand and is the result of parsing raw output from IgBLAST.
It contains a few additional columns that do not come directly from IgBLAST.
In particular, the CDR3 sequence is detected, the sequence before the V gene match is split into \fIUTR\fP and \fIleader\fP, and
the RACE\-specific run of G nucleotides is also detected.
The first row is a header row with column names.
Each subsequent row describes the IgBLAST results for a single pre\-processed input sequence.
.sp
Note: This file is typically quite large.
LibreOffice can open the file directly (even though it is compressed), but make sure you have enough RAM.
.sp
Columns:
.INDENT 0.0
.TP
.B count
How many copies of input sequence this query sequence represents. Copied from the \fB;size=3;\fP entry in the FASTA
header field that is added by \fBVSEARCH \-derep_fulllength\fP\&.
.TP
.B V_gene, D_gene, J_gene
V/D/J gene match for the query sequence
.TP
.B stop
whether the sequence contains a stop codon (either “yes” or “no”)
.UNINDENT
.sp
productive
.INDENT 0.0
.TP
.B V_covered, D_covered, J_covered
percentage of bases of the reference gene that is covered by the bases of the query sequence
.TP
.B V_evalue, D_evalue, J_evalue
E\-value of V/D/J hit
.TP
.B FR1_SHM, CDR1_SHM, FR2_SHM, CDR2_SHM, FR3_SHM, V_SHM, J_SHM
rate of somatic hypermutation (actually, an error rate)
.TP
.B V_errors, J_errors
Absolute number of errors (differences) in the V and J gene match
.TP
.B UTR
Sequence of the 5\(aq UTR (the part before the V gene match up to, but not including, the start codon)
.TP
.B leader
Leader sequence (the part between UTR and the V gene match)
.TP
.B CDR1_nt, CDR1_aa, CDR2_nt, CDR2_aa, CDR3_nt, CDR3_aa
nucleotide and amino acid sequence of CDR1/2/3
.TP
.B V_nt, V_aa
Nucleotide and amino acid sequence of V gene match
.TP
.B V_CDR3_start
Start coordinate of CDR3 within \fBV_nt\fP\&. Set to zero if no CDR3 was detected.
Comparisons involving the V gene ignore those V bases that are part of the CDR3.
.TP
.B V_end, VD_junction, D_region, DJ_junction, J_start
nucleotide sequences for various match regions
.TP
.B name, barcode, race_G, genomic_sequence
see the following explanation
.UNINDENT
.sp
The UTR, leader, barcode, race_G and genomic_sequence columns are filled in the following way.
.INDENT 0.0
.IP 1. 3
Split the 5\(aq end barcode from the sequence (if barcode length is zero, this will be empty), put it in the \fBbarcode\fP column.
.IP 2. 3
Remove the initial run of G bases from the remaining sequence, put that in the \fBrace_G\fP column.
.IP 3. 3
The remainder is put into the \fBgenomic_sequence\fP column.
.IP 4. 3
If there is a V gene match, take the sequence \fIbefore\fP it and split it up in the following way. Search for the start codon and write the part before it into the \fBUTR\fP column. Write the part starting with the start column into the \fBleader\fP column.
.UNINDENT
.SS filtered.tab.gz
.sp
This table is the same as the \fBassigned.tab.gz\fP table, except that rows containing low\-quality matches have been filtered out.
Rows fulfilling any of the following criteria are filtered:
.INDENT 0.0
.IP \(bu 2
The J gene was not assigned
.IP \(bu 2
A stop was codon found
.IP \(bu 2
The V gene coverage is less than 90%
.IP \(bu 2
The J gene coverage is less than 60%
.IP \(bu 2
The V gene E\-value is greater than 10\s-2\u\-3\d\s0
.UNINDENT
.SS candidates.tab
.sp
This table contains the candidates for novel V genes found by the \fBdiscover\fP subcommand.
As the other files, it is a text file in tab\-separated values format, with the first row containing the column headings.
It can be opened directly in LibreOffice, for example.
.sp
Candidates are found by inspecting all the sequences assigned to a database gene, and clustering them in multiple ways.
The candidate sequences are found by computing a consensus from each found cluster.
.sp
Each row describes a single candidate, but possibly multiple clusters.
If there are multiple clusters from a single gene that lead to the same consensus sequence, then they get only one row.
The \fIcluster\fP column lists the source clusters for the given sequence.
Duplicate sequences can still occur when two different genes lead to identical consensus sequences.
(These duplicated sequences are merged by the germline filters.)
.sp
Below, we use the term \fIcluster set\fP to refer to all the sequences that are in any of the listed clusters.
.sp
Some clusters lead to ambiguous consensus sequences (those that include \fBN\fP bases).
These have already been filtered out.
.INDENT 0.0
.TP
.B name
The name of the candidate gene. See \fI\%novel gene names\fP\&.
.TP
.B source
The original database gene to which the sequences from this row were originally assigned.
All candidates coming from the same source gene are grouped together.
.TP
.B chain
Chain type: \fIVH\fP for heavy, \fIVK\fP for light chain lambda, \fIVL\fP for light chain kappa
.TP
.B cluster
From which type of cluster or clusters the consensus was computed.
If there are multiple clusters that give rise to the same consensus sequence, they are all listed here, separated by semicolon.
A cluster name such as \fB2\-4\fP is for a percentage difference window:
Such a cluster consists of all sequences assigned to the source gene that have a percentage difference to it between 2 and 4 percent.
.sp
A cluster name such as \fBcl3\fP describes a cluster generated through linkage cluster analysis.
The clusters are simply named \fBcl1\fP, \fBcl2\fP, \fBcl3\fP etc.
If any cluster number seems to be missing (such as when cl1 and cl3 occur, but not cl2), then this means that the cluster led to an ambiguous consensus sequence that has been filtered out.
Since the \fBcl\fP clusters are created from a random subsample of the data (in order to keep computation time down),
they are never larger than the size of the subsample (currently 1000).
.sp
The name \fBdb\fP represents a cluster that is identical to the database sequence.
If no actual cluster corresponding to the database sequence is found, but the database sequence is expressed, a \fBdb\fP cluster is inserted artificially in order to make sure that the sequence is not lost.
The cluster name \fBall\fP represents the set of all sequences assigned to the source gene.
This means that an unambiguous consensus could be computed from all the sequences.
Typically, this happens during later iterations when there are no more novel sequences among the sequences assigned to the database gene.
.TP
.B cluster_size
The number of sequences from which the consensus was computed.
Equivalently, the size of the cluster set (all clusters described in this row).
Sequences that are in multiple clusters at the same time are counted only once.
.TP
.B Js
The number of unique J genes associated with the sequences in the cluster set.
.sp
Consensus sequences are computed only from V gene sequences, but each V gene sequence is part of a full V/D/J sequence.
We therefore know for each V sequence which J gene it was found with.
This number says how many different J genes were found for all sequences that the consensus in this row was computed from.
.TP
.B CDR3s
The number of unique CDR3 sequences associated with the sequences in the cluster set.
See also the description for the \fIJs\fP column.
This number says how many different CDR3 sequences were found for all sequences that the consensus in this row was computed from.
.TP
.B exact
The number of exact occurrences of the consensus sequence among all sequences assigned to the
source gene, ignoring the 3\(aq junction region.
.sp
To clarify: While the consensus sequence is computed only from a subset of sequences assigned
to a source gene, \fIall\fP sequences assigned to the source gene are searched for exact occurrences
of that consensus sequence.
.sp
When comparing sequences, they are first truncated at the 3\(aq end by removing those (typically
8) bases that correspond to the CDR3 region.
.TP
.B barcodes_exact
How many unique barcode sequences were used by the sequences in the set of exact sequences
(described above).
.TP
.B Ds_exact
How many unique D genes were used by the sequences in the set of exact sequences (described
above). Only those D gene assignments are included in this count for which the number of errors
is zero, the E\-value is at most a given threshold, and for which the number of covered bases
is at least a given percentage.
.TP
.B Js_exact
How many unique J genes were used by the sequences in the set of exact sequences (described above).
.TP
.B CDR3s_exact
How many unique CDR3 sequences were used by the sequences in the set of exact sequences (described above).
.TP
.B clonotypes
The estimated number of clonotypes within the set of exact sequences (which is described above).
The value is computed by clustering the unique CDR3 sequences associated with all exact
occurrences, allowing up to six differences (mismatches, insertions, deletions) and then
counting the number of resulting clusters.
.TP
.B database_diff
The number of differences between the consensus sequence and the sequence of the source gene.
(Given as edit distance, that is insertion, deletion, mismatch count as one difference each.)
.TP
.B has_stop
Indicates whether the consensus sequence contains a stop codon.
.TP
.B looks_like_V
Whether the consensus sequence “looks like” a true V gene (1 if yes, 0 if no).
Currently, this checks whether the 5\(aq end of the sequence matches a known V gene motif.
.TP
.B CDR3_start
Where the CDR3 starts within the discovered V gene sequence. This uses the most common
CDR3 start location among the sequences from which this consensus is derived.
.TP
.B consensus
The consensus sequence itself.
.UNINDENT
.sp
The \fBigdiscover discover\fP command can also be run by hand with other parameters, in which case additional columns may appear.
.INDENT 0.0
.TP
.B N_bases
Number of \fBN\fP bases in the consensus
.UNINDENT
.SS annotated_V_*.tab
.sp
The two files \fBannotated_V_germline.tab\fP and \fBannotated_V_pregermline.tab\fP are copies of the \fBcandidates.tab\fP file with two extra columns that show \fIwhy\fP a candidate was filtered in the germline and pre\-germline filtering steps. The two columns are:
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
\fBis_filtered\fP – This is a number that indicates how many filtering criteria exclude this candidate apply.
.IP \(bu 2
\fBwhy_filtered\fP – This is a semicolon\-separated list of filtering reasons.
.UNINDENT
.UNINDENT
.UNINDENT
.sp
The following values can occur in the \fBwhy_filtered\fP column:
.INDENT 0.0
.TP
.B too_low_dbdiff
The number of differences between this candidate and the database is lower than the required number.
.TP
.B too_many_N_bases
The candidate contains too many \fBN\fP nucleotide wildcard characters.
.TP
.B too_low_CDR3s_exact
The \fBCDR3s_exact\fP value for this candidate is lower than required.
.TP
.B too_high_CDR3_shared_ratio
The \fBCDR3_shared_ratio\fP is higher than the configured threshold.
.TP
.B too_low_Js_exact
The \fBJs_exact\fP value is lower than the configured threshold.
.TP
.B has_stop
The filter configuration disallows stop codons, but this candidate has one and is not whitelisted.
.TP
.B too_low_cluster_size
The \fBcluster_size\fP of this candidate is lower than the configured threshold, and the candidate is not whitelisted.
.TP
.B is_duplicate
A filtering criterion not listed above applies to this candidate. This covers all the filters that need to compare candidates to each other: cross\-mapping ratio, clonotype allele ratio, exact ratio, Ds_exact ratio.
.UNINDENT
.SS Names for discovered genes
.sp
Each gene discovered by IgDiscover gets a unique name such as “VH4.11_S1234”.
The “VH4.11” is the name of the database gene to which the novel
V gene was initially assigned. The number \fI1234\fP is derived from the nucleotide
sequence of the novel gene. That is, if you discover the same sequence in two
different runs of the IgDiscover, or just in different iterations, the number will
be the same. This may help when manually inspecting results.
.sp
Be aware that you still need to check the sequence itself since even different
sequences can sometimes lead to the same number (a “hash collision”).
.sp
The \fB_S1234\fP suffixes do not accumulate.
Before IgDiscover adds the suffix in an iteration, it removes the suffix if it already exists.
.SS Subcommands
.sp
The \fBigdiscover\fP program has multiple subcommands.
You should already be familiar with the two commands \fBinit\fP and \fBrun\fP\&.
Each subcommand comes with its own help page that shows how to use that subcommand.
Run the command with the \fB\-\-help\fP option to see the help. For example,
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
igdiscover run \-\-help
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
shows the help for the \fBrun\fP subcommand.
.sp
The following additional subcommands may be useful for further analysis.
.INDENT 0.0
.TP
.B commonv
Find common V genes between two different antibody libraries
.TP
.B upstream
Cluster upstream sequences (UTR and leader) for each gene
.TP
.B dendrogram
Draw a dendrogram of sequences in a FASTA file.
.TP
.B rename
Rename sequences in a target FASTA file using a template FASTA file
.TP
.B union
Compute union of sequences in multiple FASTA files
.UNINDENT
.sp
The following subcommands are used internally, and listed here for completeness.
.INDENT 0.0
.TP
.B filter
Filter a table with IgBLAST results
.TP
.B count
Count and plot V, D, J gene usage
.TP
.B group
Group sequences by barcode and V/J assignment and print each group’s consensus (unused in IgDiscover)
.TP
.B germlinefilter
Create new V gene database from V gene candidates using the germline and pre\-germline filter
criteria.
.TP
.B discover
Discover candidate new V genes within a single antibody library
.TP
.B clusterplot
For each V gene, plot a clustermap of the sequences assigned to it
.TP
.B errorplot
Plot histograms of differences to reference V gene
.UNINDENT
.SS Germline and pre\-germline filtering
.sp
V gene sequences found by the clustering step of the program (the \fBdiscover\fP subcommand) are
stored in the \fBcandidates.tab\fP file. The entries are “candidates” because many of these will be
PCR or other artifacts and therefore do not represent true novel V genes. The germline and
pre\-germline filters take care of removing artifacts. They germline filter is the “real” filter and
used only in the last iteration in order to obtain the final gene database. The pre\-germline filter
is less strict and used in all the earlier iterations.
.sp
The germline filters are implemented in the \fBigdiscover germlinefilter\fP subcommand. It performs the
following filtering and processing steps:
.INDENT 0.0
.IP \(bu 2
Discard sequences with \fBN\fP bases
.IP \(bu 2
Discard sequences that come from a consensus over too few source sequences
.IP \(bu 2
Discard sequences with too few unique CDR3s (CDR3s_exact column)
.IP \(bu 2
Discard sequences with too few unique Js (Js_exact column)
.IP \(bu 2
Discard sequences identical to one of the database sequences (if DB given)
.IP \(bu 2
Discard sequences that do not match a set of known good motifs
.IP \(bu 2
Discard sequences that contain a stop codon (has_stop column)
.IP \(bu 2
Discard near\-duplicate sequences
.IP \(bu 2
Discard cross\-mapping artifacts
.IP \(bu 2
Discard sequences whose “allele ratio” is too low.
.UNINDENT
.sp
If a whitelist of sequences is provided (by default, this is the input V gene database), then the
candidates that appear on it
.INDENT 0.0
.IP \(bu 2
are not checked for the cluster size criterion,
.IP \(bu 2
do not need to match a set of known good motifs,
.IP \(bu 2
are never considered near\-duplicates (but they are checked for
cross\-mapping and for the allele ratio),
.IP \(bu 2
are allowed to contain a stop codon.
.UNINDENT
.sp
Whitelisting allows IgDiscover to identify known germline sequences that are expressed at low
levels in a library. If enabled with \fBwhitelist: true\fP (the default) in the pregermline and
germline filter sections of the configuration file, the sequences present in the starting database
are treated as validated germline sequences and will not be discarded if due to too small cluster
size as long as they fulfill the remaining criteria (unique_cdr3s, unique_js etc.).
.sp
You can see why a candidate was filtered by inspecting the \fI\%annotated_V_*.tab files\fP
.SS Cross\-mapping artifacts
.sp
If two very similar sequences appear in the database used by IgBLAST,
then sequencing errors may lead to one sequence incorrectly being assigned
to the other. This is particularly problematic if one of the sequences is
highly expressed while the other is not expressed at all. The not expressed
sequence is even included in the list of V gene candidates because it is
in the input database and therefore whitelisted. We call this a “cross\-mapping
artifact”.
.sp
The germline filtering step of IgDiscover therefore aims to eliminate
cross\-mapping artifacts by checking all pairs of sequences for the following:
.INDENT 0.0
.IP \(bu 2
The two sequences have a distance of 1,
.IP \(bu 2
they are both in the database for that particular iteration (only then
can cross\-mapping occur)
.IP \(bu 2
the ratio between the expression levels of the two sequences (using
the cluster_size field in the \fBcandidates.tab\fP file) is less than the value
\fBcross_mapping_ratio\fP defined in the configuration file (0.02 by default).
.UNINDENT
.sp
If all that is the case, then the sequence with the lower expression is
discarded.
.SS Allele\-ratio filtering
.sp
When multiple alleles of the same gene appear in the list of V gene candidates,
such as IGHV1\-2*02 and IGHV1\-2*04, the germline filter computes the ratio
of the values in the \fBexact\fP and the \fBclonotypes\fP columns between them.
If the ratio is under the configured threshold, the candidate with the lower
count is discarded. See the \fBexact_ratio\fP and \fBclonotype_ratio\fP
settings in the \fBgermline_filter\fP and \fBpregermline_filter\fP sections
of the configuration file.
.sp
New in version 0.7.0.

.SS Data from the Sequence Read Archive (SRA)
.sp
To work with datasets from the Sequence Read Archive, you may want to use the
tool \fBfastq\-dump\fP, which can download the reads in the format required by
IgDiscover. You just need to know the accession number, such as “SRR2905710” and
then run this command to download the files to the current directory:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
fastq\-dump \-\-split\-files \-\-gzip SRR2905710
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The \fB\-\-split\-files\fP option ensures that the paired\-end reads are stored in two
separate files, one for the forward and one for the reverse read, respectively.
(If you do not provide it, you will get an interleaved FASTQ file that currently
cannot be read by IgDiscover). The \fB\-\-gzip\fP option creates compressed output.
The command creates two files in the current directory. In the above example,
they would be named \fBSRR2905710_1.fastq.gz\fP and \fBSRR2905710_2.fastq.gz\fP\&.
.sp
The program \fBfastq\-dump\fP is part of the SRA toolkit. On Debian\-derived
Linux distributions, you can typically install it with \fBsudo apt\-get install
sra\-toolkit\fP\&. On Conda, install it with \fBconda install \-c bioconda sra\-tools\fP\&.
.SS Does random subsampling influence results?
.sp
Random subsampling indeed influences somewhat which sequences are found by the cluster analysis,
particularly in the beginning. However, the probability is large that all highly expressed
sequences are represented in the random sample. Also, due to the database growing with subsequent
iterations, the set of sequences assigned to a single database gene becomes smaller and more
homogeneous. This makes it increasingly likely that also sequences expressed at lower levels
result in a cluster since they now make up a larger fraction of each subsample.
.sp
Also, many of the clusters which are captured in one subsample but not in the other are artifacts
that are then filtered out anyway by the pre\-germline or germline filter.
.sp
On human data with a nearly complete starting database, the subsampling seems to have no influence
at all, as we determined experimentally. We repeated a run of the program four
times on the same human dataset, using identical parameters each time except that the subsampling
was done in a different way. Although intermediate results differed, all four personalized
databases that the program produced were exactly identical.
.sp
Concordance is lower, though, when the input database is not as complete as the human one.
.sp
The way in which random subsampling is done is modified by the \fBseed\fP configuration setting,
which is set to 1 by default. If its value is the same for two different runs of the program with
otherwise identical settings, the numbers chosen by the random number generator will be the same
and therefore also subsampling will be done in an identical way. This makes runs of the program
reproducible. In order to test how results differ when subsampling is done in a different way,
change the \fBseed\fP to a different value.
.SS Logging the program’s output to a file
.sp
When you report a bug or unusual behavior to us, we might ask you to send us the output of
\fBigdiscover run\fP\&. You can send its output to a file by running the program like this:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
igdiscover run >& logfile.txt
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
And here is how to send the logging output to a file \fIand\fP also see the output in your terminal
at the same time (but you lose the colors):
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
igdiscover run |& tee logfile.txt
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Caching of IgBLAST results and of merged reads
.sp
Sometimes you may want to re\-analyze a dataset multiple times with different filter settings.
To speed this up, IgDiscover can cache the results of two of the most time\-consuming
steps, read\-merging with PEAR and running IgBLAST.
.sp
The cache is disabled by default as it uses a lot of disk space. To enable the cache, create
a file named \fB~/.config/igdiscover.conf\fP with the following contents:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
use_cache: true
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If you do so, a directory named \fB~/.cache/igdiscover/\fP is created the next time you run
IgDiscover and all IgBLAST results as well as merged reads from PEAR are stored there. On
subsequent runs, the existing result is used directly without calling the respective
program, which speeds up the pipeline considerably.
.sp
The cache is only used when we are certain that the results will indeed be the same. For
example, if the IgBLAST program version or th V/D/J database changes, the cached result
is not used.
.sp
The files in the cache are compressed, but the cache may still get large over time. You can
delete the cache with \fBrm \-r ~/.cache/igdiscover\fP to free the space.
.sp
You should also delete the cache when updating to a newer IgBLAST version as the old results
will not be used anymore.
.SS Terms
.INDENT 0.0
.TP
.B Analysis directory
The directory that was created with \fBigdiscover init\fP\&. Separate ones are created for
each experiment. When you used \fBigdiscover init myexperiment\fP, the analysis directory
would be \fBmyexperiment/\fP\&.
.TP
.B Starting database
The initial list of V/D/J genes. These are expected to be in FASTA format and are copied into
the \fBdatabase/\fP directory within each analysis directory.
.UNINDENT
.SH QUESTIONS AND ANSWERS
.SS How many sequences are needed to discover germline V gene sequences?
.sp
Library sizes of several hundred thousand sequences are required for V gene discovery, with even
higher numbers necessary for full database production. For example, IgM library sizes of 750,000
to 1,000,000 sequences for heavy chain databases and 1.5 to 2 million sequences for light chain
databases.
.SS Can IgDiscover analyze IgG libraries?
.sp
IgDiscover has been developed to identify germline databases from libraries that contain
substantial fractions of unswitched antibody sequences. We recommend IgM libraries for heavy
chain V gene identification and IgKappa and IgLambda libraries for light chain identification.
IgDiscover can identify a proportion of gemline sequences in IgG libraries but the process is
much more efficient with IgM libraries, enabling the full set of germline sequences to be
discovered.
.SS Can IgDiscover analyze a previously sequenced library?
.sp
Yes, IgDiscover accepts both unpaired FASTQ files and paired FASTA files but the program should
be made aware which is being used, see input requirements\&.
.SS Do the positions of the PCR primers make a difference to the output?
.sp
Yes. For accurate V gene discovery, all primer sequences must be external to the V gene sequences.
For example, forward multiplex amplification primers should be present in the leader sequence or
5\(aq UTR, and reverse amplification primers should be located in the constant region, preferably
close to the 5\(aq border of the constant region. Primers that are present in framework 1 region or
J segments are not recommended for library production.
.SS What are the advantages to 5\(aq\-RACE compared to multiplex PCR for IgDiscover analysis?
.sp
Both 5\(aq\-RACE and multiplex PCR have their own advantages.
.sp
5\(aq\-RACE will enable library production from species where the upstream V gene sequence is unknown.
The output of the \fBupstream\fP subcommand in IgDiscovery enables the identification of consensus
leader and 5\(aq\-UTR sequences for each of the identified germline V genes, that can subsequenctly
be used for primer design for either multiplex PCR or for monoclonal antibody amplification sets.
.sp
Multiplex PCR is recommended for species where the upstream sequences are well characterized.
Multiplex amplification products are shorter than 5\(aq\-RACE products and therefore will be easier
to pair and will have less length associated sequence errors.
.SS What is meant by \(aqstarting database\(aq?
.sp
The starting database refers to the folder that contains the three FASTA files necessary for the
process of iterative V gene discovery to begin. IgDiscover uses the standalone IgBLAST program for
comparative assignment of sequences to the starting database. Because IgBlast requires three
files (for example \fBV.fasta\fP, \fBD.fasta\fP, \fBJ.fasta\fP), three FASTA files should be included
in the database folder for each analysis to proceed.
.sp
In the case of light chains (that do not contain D segments), a dummy D segment file should be
included as IgBLAST will not proceed if it does not see three files in the database folder. It is
sufficient to save the following sequence as a fasta file and rename it D.fasta, for example,
for it to function as the dummy \fBD.fasta\fP file for human light chain analysis:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
>D_ummy
GGGGGGGGGG
.ft P
.fi
.UNINDENT
.UNINDENT
.SS How can I use the IMGT database as a starting database?
.sp
Since we do not have permission to distribute IMGT database files with IgDiscover, you need to
download them directly from \fI\%IMGT\fP\&.
See the section about obtaining a V/D/J database\&.
.SS How do I change the parameters of the program?
.sp
By editing the configuration file\&.
.SS Where do I find the individualized database produced by IgDiscover?
.sp
The final germline database in FASTA format is in your analysis
directory in the subdirectory \fBfinal/database/\fP\&. The \fBV.fasta\fP file
contains the new list of V genes. The \fBD.fasta\fP and \fBJ.fasta\fP files are unchanged from the
starting database.
.sp
A phylogenetic tree of the V sequences can be found in \fBfinal/dendrogram_V.pdf\fP\&.
.sp
For more details of how that database was created, you need to inspect the files created in the last
iteration of the discovery process, located in \fBiteration\-xx\fP, where \fBxx\fP is the number of
iterations configured in the \fBigdiscover.yaml\fP configuration file. For example, if three
iterations were used, look into \fBiteration\-03/\fP\&.
.sp
Most interesting in that folder are likely
.INDENT 0.0
.IP \(bu 2
the linkage cluster analysis plots in \fBiteration\-03/clusterplots/\fP,
.IP \(bu 2
the error histograms in \fBiteration\-03/errorhistograms.pdf\fP, which contain the windowed cluster
analysis figures.
.IP \(bu 2
Details about the individualized database in \fBnew_V_germline.tab\fP in tab\-separated\-value format
.UNINDENT
.sp
The \fBnew_V_germline.fasta\fP file is identical to the one in \fBfinal/database/V.fasta\fP
.SS What does the _S1234 at the end of same gene names mean?
.sp
Please see the Section on gene names\&.
.SH ADVANCED TOPICS
.sp
IgDiscover itself does not (yet) come with all imaginable analysis facilities built into it.
However, it creates many files (mostly with tables) that can be used for custom analysis.
For example, all \fB\&.tab\fP files (in particular \fBassigned.tab.gz\fP and \fBcandidates.tab\fP)
can be opened and inspected in a spreadsheet application such as LibreOffice. From there,
you can do basic tasks such as sorting from the menu of that application.
.sp
Often, these facilities are not enough, however, and some basic understanding of the
command\-line is helpful. Clearly, this is not as convenient as working in a graphical
user interface (GUI), but we do not currently have the resources to provide one for
IgDiscover. To alleviate this somewhat, we provide here instructions for a few things
that you may want to do with the IgDiscover result files.
.SS Extract all sequences that match any database gene exactly
.sp
The \fBcandidates.tab\fP file tells you for each discovered sequence how often an \fIexact match\fP
of that sequence was found in your input reads. A high number of exact matches is a good
indication that the candidate is actually a new gene or allele. In order to find the original
reads that correspond to those matches, you can look at the \fBfiltered.tab.gz\fP file and
extract all rows where the \fBV_errors\fP column is zero.
.sp
First, run this on the filtered.tab.gz file:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
zcat filtered.tab.gz | head \-n 1 | tr \(aq\et\(aq \(aq\en\(aq | nl
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This will enumerate the columns in the file. Take a note of the index
that the V_errors column has. In newer pipeline versions, the index is
21. Then extract all rows of the file where that field is equal to zero:
.INDENT 0.0
.INDENT 3.5
zcat filtered.tab.gz | awk \-vFS="t" \(aq$21 == 0 || NR == 1\(aq > exact.tab
.UNINDENT
.UNINDENT
.sp
If the column wasn’t 21, then replace the \fB$21\fP appropriately. The part
where it says \fBNR == 1\fP ensures that the column headings are also printed.
.SS Extra configuration settings
.sp
Some configuration settings are not documented in the default \fBigdiscover.yaml\fP file
since they rarely need to be changed.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
# Leave empty or choose a species name supported by IgBLAST:
# human, mouse, rabbit, rat, rhesus_monkey
# This setting is not used anywhere except that it is passed
# to IgBLAST using the \-organism option. Since we provide IgBLAST
# with our own gene databases, it seems this has no effect.
species:
.ft P
.fi
.UNINDENT
.UNINDENT
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
# Which program to use for computing multiple alignments. This is used for
# computing consens sequences.
# Choose \(aqmafft\(aq, \(aqclustalo\(aq, \(aqmuscle\(aq or \(aqmuscle\-fast\(aq.
# \(aqmuscle\-fast\(aq runs muscle with parameters "\-maxiters 1 \-diags".
#
#multialign_program: muscle\-fast
.ft P
.fi
.UNINDENT
.UNINDENT
.SH DEVELOPMENT
.INDENT 0.0
.IP \(bu 2
\fI\%Source code\fP
.IP \(bu 2
\fI\%Report an issue\fP
.UNINDENT
.SS Installing the development version
.sp
To use the most recent IgDiscover version from Git, follow these steps.
.INDENT 0.0
.IP 1. 3
If you haven’t done so, install miniconda. See the first steps of the
regular installation instructions\&. Do not install
IgDiscover, yet!
.IP 2. 3
Clone the IgDiscover repository:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
git clone https://github.com/NBISweden/IgDiscover.git
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
(Use the git@ URL instead if you are a developer.)
.UNINDENT
.INDENT 0.0
.IP 4. 3
Create a new Conda environment using the \fBenvironment.yml\fP file in the
repository:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
cd IgDiscover
conda env create \-n igdiscover \-f environment.yml
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
You can choose a different environment name by changing the name after the
\fB\-n\fP parameter. This may be necessary, when you already have a regular
(non\-developer) IgDiscover installation in an \fBigdiscover\fP environment
that you don’t want to overwrite.
.IP 5. 3
Activate the environment:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
source activate igdiscover
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
(Or use whichever name you chose above.)
.IP 6. 3
Install IgDiscover in “editable” mode:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
python3 \-m pip install \-e .
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.sp
Whenever you want to update the software:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
cd IgDiscover
git pull
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
It may also be necessary to repeat the \fBpython3 \-m pip install \-e .\fP step.
.SS IgBLAST result cache
.sp
For development, in particular when running tests repeatedly, you should enable the IgBLAST
result cache. The cache stores IgBLAST output. If the same dataset with the same dataset is run
a second time, the result is retrieved from the cache and IgBLAST is not re\-run. This saves a lot
of time when re\-running datasets, but may also fill up the cache directory \fB~/.cache/igdiscover/\fP\&.
Also, in production, datasets are usually not re\-run with the same settings, which is why
caching is disabled by default.
.sp
To enable the cache, create a file \fB~/.config/igdiscover.conf\fP with the following content:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
use_cache: true
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The file is in YAML format, but at the moment, no other settings are supported.
.SS Building the documentation
.sp
Go to the \fBdoc/\fP directory in the repository, then run:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
make
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
to build the documentation locally. Open \fB_build/html/index.html\fP in
a browser. The layout is different from the \fI\%version shown on
Read the Docs\fP, but allows you to
preview any changes you may have made.
.SS Making a release
.sp
We use \fI\%versioneer\fP to
manage version numbers. It extracts the version number from the
most recent tag in Git. Thus, to increment the version number, create
a Git tag:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
git tag v0.5
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The \fBv\fP prefix is mandatory.
.sp
Then:
.INDENT 0.0
.IP \(bu 2
\fBtests/run.sh\fP
.IP \(bu 2
\fBpython3 setup.py sdist\fP
.IP \(bu 2
\fBtwine upload sdist/igdiscover\-0.10.tar.gz\fP
.IP \(bu 2
Update bioconda recipe
.UNINDENT
.SS Removing IgDiscover from a Linux system
.sp
If you have been playing around with different installation methods (\fBpip\fP,
\fBconda\fP, \fBgit\fP, \fBpython3 setup.py install\fP etc.) you may have multiple
copies of IgDiscover on your system and you will likely run into problems
on updates. Here is a list you can follow in order to get rid of the
installations as preparation for a clean re\-install. \fIDo not\fP add \fBsudo\fP to
the commands below if you get permission problems, unless explicitly told to do
so! If one of the steps does not work, that is fine, just continue.
.INDENT 0.0
.IP 1. 3
Delete miniconda: Run the command \fBwhich conda\fP\&. The output will be
something like \fB/home/myusername/miniconda3/bin/conda\fP\&. The part before
\fBbin/conda\fP is the miniconda installation directory. Delete that folder. In
this case, you would need to delete \fBminiconda3\fP in \fB/home/myusername\fP\&.
.IP 2. 3
Run \fBpip3 uninstall igdiscover\fP\&. If this runs successfully and prints some
messages about removing files, then \fIrepeat the same command\fP! Do this
until you get a message telling you that the package cannot be uninstalled
because it is not installed.
.IP 3. 3
Repeat the previous step, but with \fBpip3 uninstall sqt\fP\&.
.IP 4. 3
If you have a directory named \fB\&.local\fP within your home directory, you may
want to rename it: \fBmv .local dot\-local\-backup\fP You can also delete it, but
there is a small risk that other software (not IgDiscover) uses that
directory. The directory is hidden, so a normal \fBls\fP will not show it.
Use \fBls \-la\fP while in your home directory to see it.
.IP 5. 3
If you have ever used \fBsudo\fP to install IgDiscover, you may have an
installation in \fB/usr/local/\fP\&. You can try to remove it with
\fBsudo pip3 uninstall igdiscover\fP\&.
.IP 6. 3
Delete the cloned Git repository if you have one. This is the directory in
which you run \fBgit pull\fP\&.
.UNINDENT
.sp
Finally, you can follow the normal installation instructions and then the
developer installation instructions.
.SH CHANGES
.SS v0.11 (2018\-11\-27)
.INDENT 0.0
.IP \(bu 2
The IgBLAST cache is now disabled by default. We assume that, in most cases,
datasets will not be re\-run with the exact same parameters, and then it only
fills up the disk. Delete your cache with \fBrm \-r ~/.cache/igdiscover\fP to
reclaim the space. To enable the cache, create a file
\fB~/.config/igdiscover.conf\fP with the contents \fBuse_cache: true\fP\&.
.IP \(bu 2
If you choose to enable the cache, results from the PEAR merging step will
now also be cached. See also the caching documentation\&.
.IP \(bu 2
Added detection of chimeras to the (pre\-)germline filters. Any novel allele
that can be explained as a chimera of two unmodified reference alleles is
marked in the \fBnew_V_germline.tab\fP file. This is a bit sensitive, so the
candidate is currently not discarded.
.IP \(bu 2
Two additional files \fBannotated_V_germline.tab\fP and
\fBannotated_V_pregermline.tab\fP are created in each iteration during the
germline filtering step. These are identical to the \fBcandidates.tab\fP
file, except that they contain a \fBwhy_filtered\fP column that describes
why a sequence was filtered. See the documentation for this feature\&.
.IP \(bu 2
A more realistic test dataset (v0.5), now based on human instead of rhesus
data, was prepared. The testing instructions have been
updated accordingly.
.IP \(bu 2
J discovery has been tuned to give fewer truncated sequences.
.IP \(bu 2
Statistics are written to \fBstats/stats.json\fP\&.
.IP \(bu 2
V SHM distribution plots are created automatically and written written to
\fBv\-shm\-distributions.pdf\fP in each iteration folder.
.IP \(bu 2
An \fBigdiscover dbdiff\fP subcommand was added that can compare two FASTA
files.
.UNINDENT
.SS v0.10 (2018\-05\-11)
.INDENT 0.0
.IP \(bu 2
When computing a consensus sequence, allow some sequences to be truncated in
the 3\(aq end. Many of the discovered novel V alleles were truncated by one
nucleotide in the 3\(aq end because IgBLAST does not always extend the
alignment to the end of the V sequence. If these slightly too short V
sequences were in the majority, their consensus would lead to a truncated
sequence as well. The new consensus algorithm allows for this effect at the
3\(aq end and can therefore more often than previously find the full sequence.
Example:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
TACTGTGCGAGAGA (seq 1)
TACTGTGCGAGAGA (seq 2)
TACTGTGCGAGAG\- (seq 3)
TACTGTGCGAG\-\-\- (seq 4)
TACTGTGCGAG\-\-\- (seq 5)

TACTGTGCGAGAG  (previous consensus)
TACTGTGCGAGAGA (new consensus)
.ft P
.fi
.UNINDENT
.UNINDENT
.IP \(bu 2
Add a column \fBdatabase_changes\fP to the \fBnew_V_germline.tab\fP file that
describes how the novel sequence differs from the database sequence. Example:
\fB93C>T; 114A>G\fP
.IP \(bu 2
Allow filtering by \fBCDR3_shared_ratio\fP and do so by default (needs
documentation)
.IP \(bu 2
Cache the edit distance when computing the distance matrix. Speeds up the
\fBdiscover\fP command slightly.
.IP \(bu 2
\fBdiscover\fP: Use more than six CPU cores if available
.IP \(bu 2
\fBigblast\fP: Print progress every minute
.UNINDENT
.SS v0.9 (2018\-03\-22)
.INDENT 0.0
.IP \(bu 2
Implemented allele ratio filtering for J gene discovery
.IP \(bu 2
J genes are discovered as part of the pipeline (previously, one needed
to run the \fBdiscoverj\fP script manually)
.IP \(bu 2
In each iteration, dendrograms are now created not only for V genes, but
also for D and J genes. The file names are \fBdendrogram_D.pdf\fP,
\fBdendrogram_J.pdf\fP
.IP \(bu 2
The V dendrograms are now in \fBdendrogram_V.pdf\fP (no longer
\fBV_dendrogram.pdf\fP). This puts all the dendrograms together when looking
at the files in the iteration directory.
.IP \(bu 2
The \fBV_usage.tab\fP and \fBV_usage.pdf\fP files are no longer created.
Instead, \fBexpressed_V.tab\fP and \fBexpressed_V.pdf\fP are created. These
contain similar information, but an allele\-ratio filter is used to
filter out artifacts.
.IP \(bu 2
Similarly, \fBexpressed_D.tab\fP and \fBexpressed_J.tab\fP and their
\fB\&.pdf\fP counterparts are created in each iteration.
.IP \(bu 2
Removed \fBparse\fP subcommand (functionality is in the \fBigblast\fP subcommand)
.IP \(bu 2
New CDR3 detection method (only heavy chain sequences): CDR3 start/end coordinates
are pre\-computed using the database V and J sequences. Increases detection rate
to 99% (previously less than 90%).
.IP \(bu 2
Remove the ability to check discovered genes for required motifs. This has never
worked well.
.IP \(bu 2
Add a column \fBclonotypes\fP to the \fBcandidates.tab\fP that tries to count how many
clonotypes are associated with a single candidate (using only exact occurrences).
This is intended to replace the \fBCDR3s_exact\fP column.
.IP \(bu 2
Add an \fBexact_ratio\fP to the germline filtering options. This checks the ratio
between the exact V occurrence counts (\fBexact\fP column) between alleles.
.IP \(bu 2
Germline filtering option \fBallele_ratio\fP was renamed to \fBclonotypes_ratio\fP
.IP \(bu 2
Implement a cache for IgBLAST results. When the same dataset is re\-analyzed,
possibly with different parameters, the cached results are used instead of
re\-running IgBLAST, which saves a lot of time. If the V/D/J database or the
IgBLAST version has changed, results are not re\-used.
.UNINDENT
.SS v0.8.0 (2017\-06\-20)
.INDENT 0.0
.IP \(bu 2
Add a \fBbarcodes_exact\fP column to the candidates table. It gives the number
of unique barcode sequences that were used by the sequences in the set of
exact sequences. Also, add a configuration setting \fBbarcode_consensus\fP
that can turn off consensus taking of barcode groups, which needs to be
set to \fBfalse\fP for \fBbarcodes_exact\fP to work.
.IP \(bu 2
Add a \fBDs_exact\fP column to candidates table.
.IP \(bu 2
Add a \fBD_coverage\fP configuration option.
.IP \(bu 2
The pre\-processing filtering step no longer reads in the full table of
IgBLAST assignments, but filters the table piece by piece. Memory usage
for this step therefore does not depend anymore on the dataset size and
should always be below 1 GB.
.IP \(bu 2
The functionality of the \fBparse\fP subcommand has been integrated into
the \fBigblast\fP subcommand. This means that \fBigdiscover igblast\fP now
directly outputs a result table (\fBassigned.tab\fP). This makes it easier
to use that subcommand directly instead of only via the workflow.
.IP \(bu 2
The \fBigblast\fP subcommand now always runs \fBmakeblastdb\fP by itself
and deletes the BLAST database afterwards. This reduces clutter and
ensures the database is always up to date.
.IP \(bu 2
Remove the \fBlibrary_name\fP configuration setting. Instead, the
\fBlibrary_name\fP is now always the same as the name of analysis
directory.
.UNINDENT
.SS v0.7.0 (2017\-05\-04)
.INDENT 0.0
.IP \(bu 2
Add an “allele ratio” criterion to the germline filter to further reduce
the number of false positives. The filter is activated by default and can
be configured through the \fBallele_ratio\fP setting in the configuration
file. See the documentation for how it works\&.
.IP \(bu 2
Ignore the CDR3\-encoding bases whenever comparing two V gene sequences.
.IP \(bu 2
Avoid finding 5\(aq\-truncated V genes by extending found hits towards the
5\(aq end.
.IP \(bu 2
By default, candidate sequences are no longer merged if they are nearly
identical. That is, the \fBdifferences\fP setting within the two germline
filter configuration sections is now set to zero by default.
Previously, we believed the merging would remove some false
positives, but it turns out we also miss true positives. It also seems
that with the other changes in this version we also no longer get the
particular false positives the setting was supposed to catch.
.IP \(bu 2
Implement an experimental \fBdiscoverj\fP script for J gene discovery.
It is curently not run automatically as part of \fBigdiscover run\fP\&. See
\fBigdiscover discoverj \-\-help\fP for how to run it manually.
.IP \(bu 2
Add a \fBconfig\fP subcommand, which can be used to change the
configuration file from the command\-line.
.IP \(bu 2
Add a \fBV_CDR3_start\fP column to the \fBassigned.tab\fP/\fBfiltered.tab\fP
tables. It describes where the CDR3 starts within the V sequence.
.IP \(bu 2
Similarly, add a \fBCDR3_start\fP column to the \fBnew_V_germline.tab\fP
file describing where the CDR3 starts within a discovered V sequence.
It is computed by using the most common CDR3 start of the
sequences within the cluster.
.IP \(bu 2
Rename the \fBcompose\fP subcommand to \fBgermlinefilter\fP\&.
.IP \(bu 2
The \fBinit\fP subcommand automatically fixes certain problems in the
input database (duplicate sequences, empty records, duplicate sequence
names). Previously, it would complain, but the user would have to fix
the problems themselves.
.IP \(bu 2
Move source code to GitHub
.IP \(bu 2
Set up automatic code testing (continuous integration) via Travis
.IP \(bu 2
Many documentation improvements
.UNINDENT
.SS v0.6.0 (2016\-12\-07)
.INDENT 0.0
.IP \(bu 2
The FASTA files of the input V/D/J gene lists now need to be
named \fBV.fasta\fP, \fBD.fasta\fP and \fBJ.fasta\fP\&. The species name
is no longer part of the file name. This should reduce confusion
when working with species not supported by IgBLAST.
.IP \(bu 2
The \fBspecies:\fP configuration setting in the configuration can
(and should) now be left empty. Its only use was that it is passed
to IgBLAST, but since IgDiscover provides IgBLAST with its own
V/D/J sequences anyway, it does not seem to make a difference.
.IP \(bu 2
A “cross\-mapping” detection has been added, which should reduce
the number of false positives.
See the documentation for an explanation\&.
.IP \(bu 2
Novel sequences identical to a database sequence no longer get the
\fB_S1234\fP suffix.
.IP \(bu 2
No longer trim trim the initial \fBG\fP run in sequences (due to RACE) by
default. It is now a configuration setting.
.IP \(bu 2
Add \fBcdr3_location\fP configuration setting: It allows to set whether to
use a CDR3 in addition to the barcode for grouping sequences.
.IP \(bu 2
Create a \fBgroups.tab.gz\fP file by default (describing the de\-barcoded
groups)
.IP \(bu 2
The pre\-processing filter is now configurable. See the
\fBpreprocessing_filter\fP section in the configuration file.
.IP \(bu 2
Many improvements to the documentation
.IP \(bu 2
Extended and fixed unit tests. These are now run via a CI system.
.IP \(bu 2
Statistics in JSON format are written to \fBstats/stats.json\fP\&.
.IP \(bu 2
IgBLAST 1.5.0 output can now be parsed. Parsing is also faster by 25%.
.IP \(bu 2
More helpful warning message when no sequences were discovered in
an iteration.
.IP \(bu 2
Drop support for Python 3.3.
.UNINDENT
.SS v0.5 (2016\-09\-01)
.INDENT 0.0
.IP \(bu 2
V sequences of the input database are now whitelisted by default.
The meaning of the \fBwhitelist\fP configuration option has changed:
If set to \fBfalse\fP, those sequences are no longer whitelisted.
To whitelist additional sequences, create a \fBwhitelist.fasta\fP
file as before.
.IP \(bu 2
Sequences with stop codons are now filtered out by default.
.IP \(bu 2
Use more stringent germline filtering parameters by default.
.UNINDENT
.SS v0.4 (2016\-08\-24)
.INDENT 0.0
.IP \(bu 2
It is now possible to install and run IgDiscover on OS X. Appropriate Conda
packages are available on bioconda.
.IP \(bu 2
Add column \fBhas_stop\fP to \fBcandidates.tab\fP, which indicates whether the
candidate sequence contains a stop codon.
.IP \(bu 2
Add a configuration option that makes it possible to disable the 5\(aq motif
check by setting \fBcheck_motifs: false\fP (the \fBlooks_like_V\fP column is
ignored in this case).
.IP \(bu 2
Make it possible to whitelist known sequences: If a found gene candidate
appears in that list, the sequence is included in the list of discovered
sequences even when it would otherwise not pass filtering criteria. To enable
this, just add a \fBwhitelist.fasta\fP file to the project directory before
starting the analysis.
.IP \(bu 2
The criteria for germline filter and pre\-germline filter are now configurable:
See \fBgermline_filter\fP and \fBpre_germline_filter\fP sections in the
configuration file.
.IP \(bu 2
Different runs of IgDiscover with the same parameters on the same input files
will now give the same results. See the \fBseed\fP parameter in the configuration,
also on how to get non\-reproducible results as before.
.IP \(bu 2
Both the germline and pre\-germline filter are now applied in each iteration.
Instead of the \fBnew_V_database.fasta\fP file, two files named
\fBnew_V_germline.fasta\fP and \fBnew_V_pregermline.fasta\fP are created.
.IP \(bu 2
The \fBcompose\fP subcommand now outputs a filtered version of the
\fBcandidates.tab\fP file in addition to a FASTA file. The table
contains columns \fBclosest_whitelist\fP, which is the name of the closest
whitelist sequence, and \fBwhitelist_diff\fP, which is the number of differences
to that whitelist sequence.
.UNINDENT
.SS v0.3 (2016\-08\-08)
.INDENT 0.0
.IP \(bu 2
Optionally, sequences are not renamed in the \fBassigned.tab\fP file, but
retain their original name as in the FASTA or FASTQ file. Set \fBrename:
false\fP in the configuration file to get this behavior.
.IP \(bu 2
Started an “advanced” section in the manual.
.UNINDENT
.SS v0.2
.INDENT 0.0
.IP \(bu 2
IgDiscover can now also detect kappa and lambda light chain V genes (VK, VL)
.UNINDENT
.SH AUTHOR
Marcel Martin
.SH COPYRIGHT
2015-2019, Marcel Martin
.\" Generated by docutils manpage writer.
.