.\" Man page generated from reStructuredText.
.
.TH "BCBIO_NEXTGEN" "1" "Jan 23, 2019" "1.1.2" "bcbio-nextgen"
.SH NAME
bcbio_nextgen \- bcbio_nextgen Documentation
.
.nr rst2man-indent-level 0
.
.de1 rstReportMargin
\\$1 \\n[an-margin]
level \\n[rst2man-indent-level]
level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
-
\\n[rst2man-indent0]
\\n[rst2man-indent1]
\\n[rst2man-indent2]
..
.de1 INDENT
.\" .rstReportMargin pre:
. RS \\$1
. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
. nr rst2man-indent-level +1
.\" .rstReportMargin post:
..
.de UNINDENT
. RE
.\" indent \\n[an-margin]
.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
.nr rst2man-indent-level -1
.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
..
[image: bcbio banner]
[image]
.sp
A python toolkit providing best\-practice pipelines for fully automated
high throughput sequencing analysis. You write a high level
configuration file specifying your inputs and analysis parameters.
This input drives a parallel pipeline that handles distributed
execution, idempotent processing restarts and safe transactional
steps. The goal is to provide a shared community resource that handles
the data processing component of sequencing analysis, providing
researchers with more time to focus on the downstream biology.
.sp
Contents:
.sp
bcbio\-nextgen provides best\-practice pipelines for automated analysis
of high throughput sequencing data with the goal of being:
.INDENT 0.0
.IP \(bu 2
Quantifiable: Doing good science requires being able to accurately
assess the quality of results and re\-verify approaches as new
algorithms and software become available.
.IP \(bu 2
Analyzable: Results feed into tools to make it easy to query and
visualize the results.
.IP \(bu 2
Scalable: Handle large datasets and sample populations on distributed
heterogeneous compute environments.
.IP \(bu 2
Reproducible: Track configuration, versions, provenance and command
lines to enable debugging, extension and reproducibility of results.
.IP \(bu 2
Community developed: The development process is fully open and
sustained by contributors from multiple institutions. By working
together on a shared framework, we can overcome the challenges
associated with maintaining complex pipelines in a rapidly changing
area of research.
.IP \(bu 2
Accessible: Bioinformaticians, biologists and the general public
should be able to run these tools on inputs ranging from research
materials to clinical samples to personal genomes.
.UNINDENT
.SH USERS
.sp
A sample of institutions using bcbio\-nextgen for solving biological
problems. Please submit your story if you\(aqre using the pipeline in
your own research.
.INDENT 0.0
.IP \(bu 2
\fI\%Harvard School of Public Health\fP: We use bcbio\-nextgen within the
bioinformatics core for variant calling on large population studies
related to human health like Breast Cancer and Alzheimer\(aqs disease.
Increasing scalability of the pipeline has been essential for
handling study sizes of more than 1400 whole genomes.
.UNINDENT
.INDENT 0.0
.IP \(bu 2
\fI\%Massachusetts General Hospital\fP: The Department of Molecular
Biology uses the pipeline to automatically process samples coming
off Illumina HiSeq instruments. Automated pipelines perform
alignment and sample\-specific analysis, with results directly
uploaded into a local \fI\%Galaxy\fP instance.
.UNINDENT
.INDENT 0.0
.IP \(bu 2
\fI\%Science for Life Laboratory\fP: The genomics core platform in
the Swedish National Infrastructure (\fI\%NGI\fP) for genomics, has crunched
over 16TBp (terabasepairs) and processed almost 7000+ samples
from the beginning of 2013 until the end of July. \fI\%UPPMAX\fP, our
cluster located in Uppsala runs the pipeline in production since 2010.
.UNINDENT
.INDENT 0.0
.IP \(bu 2
\fI\%Institute of Human Genetics, UCSF\fP: The Genomics Core Facility
utilizes bcbio\-nextgen in processing more than 2000 whole genome,
exome, RNA\-seq, ChIP\-seq on various projects. This pipeline
tremendously lowers the barrier of getting access to next generation
sequencing technology. The community engaged here is also very
helpful in providing best practices advices and up\-to\-date solution
to ease scientific discovery.
.UNINDENT
.INDENT 0.0
.IP \(bu 2
\fI\%IRCCS "Mario Negri" Institute for Pharmacological Research\fP:
The Translational Genomics Unit in the Department of Oncology uses
bcbio\-nextgen for targeted resequencing (using an Illumina MiSeq) to
identify mutations and other variants in tumor samples to
investigate their link to tumor progression, patient survival and
drug sensitivity and resistance. A
\fI\%poster from the 2014 European Society of Human Genetics meeting\fP
provides more details on usage in ovarian cancer. A paper on the study of
longitudinal ovarian cancer biopsies, which makes extensive use
of bcbio\-nextgen, \fI\%was published in 2015 in Annals of Oncology\fP\&.
.UNINDENT
.INDENT 0.0
.IP \(bu 2
\fI\%The Translational Genomics Research Institute (TGen)\fP:
Members of the \fI\%Huentelman lab\fP at TGen apply bcbio\-nextgen to a wide
variety of studies of with a major focus in the neurobiology of aging
and neurodegeneration in collaboration with the The Arizona Alzheimer\(aqs Consortium (\fI\%AAC\fP)
and  the \fI\%McKnight Brain Research Foundation\fP\&.
We also use bcbio in studies of rare diseases in children through TGen\(aqs
Center for Rare Childhood Disorders (\fI\%C4RCD\fP),  and other rare diseases such as
Multiple System Atrophy (\fI\%MSA\fP). bcbio\-nextgen has also been instrumental in
projects for TGen\(aqs Program for Canine Health & Performance (\fI\%PCHP\fP)
and numerous RNA\-seq projects using rodent models. Our work with bcbio
started with a parnership with \fIDell\fP and The Neuroblastoma and
Medulloblastoma Translational Research Consortium (\fI\%NMTRC\fP),
and TGen as part of a Phase I clinical trial in these rare childhood cancers.
.UNINDENT
.INDENT 0.0
.IP \(bu 2
\fI\%Computer Science and Artificial Intelligence Laboratory (CSAIL),
MIT\fP: The \fI\%Gifford lab\fP uses the bcbio\-nextgen pipeline to analyze
a variety of sequencing datasets for their research in genetics and
regulatory genomics (including the \fI\%SysCode\fP and
\fI\%stemcell.mit.edu\fP projects).  The pipeline applies
collaboratively\-developed best practices for analysis as well as
computation, which enables the lab to run the pipeline on local
clusters and Amazon EC2.
.UNINDENT
.INDENT 0.0
.IP \(bu 2
\fI\%Sheffield Bioinformatics Core, The University of Sheffield\fP: The Sheffield Bioinformatics Core is a relatively new Core facility at The University of Sheffield, and bcbio has been instrumental in setting\-up a best\-practice Bioinformatics analysis service. We employ bcbio to automate the analyses of RNA\-seq, small RNA and ChiP\-Seq datasets for researchers at The University of Sheffield and NIHR Biomedical Research Centre. In conjunction with the bcbioRNASeq Bioconductor package, we deliver publication\-quality reports to our researchers based on reproducible analyses.
.UNINDENT
.SH AUTOMATED
.sp
We provide an automated script that installs third party analysis tools,
required genome data and python library dependencies for running human variant
and RNA\-seq analysis, bundled into an isolated directory or virtual environment:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
wget https://raw.github.com/bcbio/bcbio\-nextgen/master/scripts/bcbio_nextgen_install.py
python bcbio_nextgen_install.py /usr/local/share/bcbio \-\-tooldir=/usr/local \e
  \-\-genomes GRCh37 \-\-aligners bwa \-\-aligners bowtie2
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
bcbio should install cleanly on Linux systems. For Mac OSX, we suggest
trying \fI\%bcbio\-vm\fP which runs
bcbio on docs\-cloud or isolates all the third party tools inside a
Docker container. bcbio\-vm is still a work in progress but not all of the
dependencies bcbio uses install cleanly on OSX.
.sp
With the command line above, indexes and associated data files go in
\fB/usr/local/share/bcbio\-nextgen\fP and tools are in \fB/usr/local\fP\&. If you
don\(aqt have write permissions to install into the \fB/usr/local\fP directories you
can install in a user directory like \fB~/local\fP or use \fBsudo chmod\fP to give
your standard user permissions. Please don\(aqt run the installer with sudo or as
the root user.  Do not use directories with \fB:\fP in the name, it is not
POSIX compliant and will cause installation failures.
.sp
The installation is highly customizable, and you can install
additional software and data later using \fBbcbio_nextgen.py upgrade\fP\&.
Run \fBpython bcbio_nextgen_install.py\fP with no arguments to see options
for configuring the installation process. Some useful arguments are:
.INDENT 0.0
.IP \(bu 2
\fB\-\-isolate\fP Avoid updating the user\(aqs \fB~/.bashrc\fP if installing in a
non\-standard PATH. This facilitates creation of isolated modules
without disrupting the user\(aqs environmental setup. Manually edit your
\fI~/.bashrc\fP to allow bcbio runs with:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
export PATH=/path_to_bcbio/bin:$PATH
.ft P
.fi
.UNINDENT
.UNINDENT
.IP \(bu 2
\fB\-\-nodata\fP Do not install genome data.
.UNINDENT
.sp
The machine will need to have some basic requirements for installing and running
bcbio:
.INDENT 0.0
.IP \(bu 2
Python 2.7, Python 3.x, or Python 2.6 plus the argparse dependency.
.IP \(bu 2
Basic system setup for unpacking files: tar, gzip, unzip, bzip2, xz\-utils.
.IP \(bu 2
The git version control system (\fI\%http://git\-scm.com/\fP)
.IP \(bu 2
wget for file retrieval (\fI\%https://www.gnu.org/software/wget/\fP)
.UNINDENT
.sp
Optional tool specific requirements:
.INDENT 0.0
.IP \(bu 2
Java 1.7, needed when running GATK < 3.6 or MuTect. This must be available in
your path so typing \fBjava \-version\fP resolves a 1.7 version. bcbio
distributes Java 8 as part of the anaconda installation for recent versions of
GATK and MuTect2. You can override the Java 8 installed with bcbio by setting
\fBBCBIO_JAVA_HOME=/path/to/your/javadir\fP if you have the java you want in
\fB/path/to/your/javadir/bin/java\fP\&.
.IP \(bu 2
An OpenGL library, like \fI\%Mesa\fP (On Ubuntu/deb systems: \fBlibglu1\-mesa\fP,
On RedHat/rpm systems: \fBmesa\-libGLU\-devel\fP). This is only required for
cancer heterogeneity analysis with BubbleTree.
.IP \(bu 2
The Pisces tumor\-only variant callers requires the \fI\%Microsoft .NET runtime\fP\&.
.UNINDENT
.sp
The \fI\%bcbio\-nextgen Dockerfile\fP contains
the packages needed to install on bare Ubuntu systems.
.sp
The automated installer creates a fully integrated environment that allows
simultaneous updates of the framework, third party tools and biological data.
This offers the advantage over manual installation of being able to manage and
evolve a consistent analysis environment as algorithms continue to evolve and
improve. Installing this way is as isolated and self\-contained as possible
without virtual machines or lightweight system containers like \fI\%Docker\fP\&. The
\fI\%Upgrade\fP section has additional documentation on including
additional genome data for supported bcbio genomes. For genome builds not
included in the defaults, see the documentation on config\-custom\&.
Following installation, you should edit the pre\-created system configuration
file in \fB/usr/local/share/bcbio\-nextgen/galaxy/bcbio_system.yaml\fP to match
your local system or cluster configuration (see tuning\-cores).
.SH UPGRADE
.sp
We use the same automated installation process for performing upgrades
of tools, software and data in place. Since there are multiple targets
and we want to avoid upgrading anything unexpectedly, we have specific
arguments for each. Generally, you\(aqd want to upgrade the code, tools
and data together with:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_nextgen.py upgrade \-u stable \-\-tools \-\-data
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Tune the upgrade with these options:
.INDENT 0.0
.IP \(bu 2
\fB\-u\fP Type of upgrade to do for bcbio\-nextgen code. \fBstable\fP
gets the most recent released version and \fBdevelopment\fP
retrieves the latest code from GitHub.
.IP \(bu 2
\fB\-\-datatarget\fP Customized installed data or download additional files not
included by default: \fI\%Customizing data installation\fP
.IP \(bu 2
\fB\-\-toolplus\fP Specify additional tools to include. See the section on
\fI\%Extra software\fP for more details.
.IP \(bu 2
\fB\-\-genomes\fP and \fB\-\-aligners\fP options add additional aligner
indexes to download and prepare. \fBbcbio_nextgen.py upgrade \-h\fP lists
available genomes and aligners. If you want to install multiple genomes or
aligners at once, specify \fB\-\-genomes\fP or \fB\-\-aligners\fP
multiple times, like this:
\fB\-\-genomes GRCh37 \-\-genomes mm10 \-\-aligners bwa \-\-aligners bowtie2\fP
.IP \(bu 2
Leave out the \fB\-\-tools\fP option if you don\(aqt want to upgrade third party
tools. If using \fB\-\-tools\fP, it will use the same directory as specified
during installation. If you\(aqre using an older version that has not yet gone
through a successful upgrade or installation and saved the tool directory, you
should manually specify \fB\-\-tooldir\fP for the first upgrade. You can also pass
\fB\-\-tooldir\fP to install to a different directory.
.IP \(bu 2
Leave out the \fB\-\-data\fP option if you don\(aqt want to get any upgrades
of associated genome data.
.IP \(bu 2
Some aligners such as STAR don\(aqt have pre\-built indices due to the large file
sizes of these. You set the number of cores to use for indexing with
\fB\-\-cores 8\fP\&.
.UNINDENT
.SH CUSTOMIZING DATA INSTALLATION
.sp
bcbio installs associated data files for sequence processing, and you\(aqre able to
customize this to install larger files or change the defaults. Use the
\fB\-\-datatarget\fP flag (potentially multiple times) to customize or add new
targets.
.sp
By default, bcbio will install data files for \fBvariation\fP, \fBrnaseq\fP and
\fBsmallrna\fP but you can sub\-select a single one of these if you don\(aqt require
other analyses. The available targets are:
.INDENT 0.0
.IP \(bu 2
\fBvariation\fP \-\- Data files required for variant calling: SNPs, indels and
structural variants. These include files for annotation like dbSNP, associated
files for variant filtering, coverage and annotation files.
.IP \(bu 2
\fBrnaseq\fP \-\- Transcripts and indices for running RNA\-seq. The transcript
files are also used for annotating and prioritizing structural variants.
.IP \(bu 2
\fBsmallrna\fP \-\- Data files for doing small RNA analysis.
.IP \(bu 2
\fBgemini\fP \-\- The \fI\%GEMINI\fP framework
associates publicly available metadata with called variants, and provides
utilities for query and analysis. This target installs the required GEMINI
data files, including \fI\%ExAC\fP\&.
.IP \(bu 2
\fBgnomad\fP \-\- \fI\%gnomAD\fP is a large scale
collection of genome variants, expanding on ExAC to include whole genome and
more exome inputs. This is a large 25Gb download, available for human genome
builds GRCh37, hg19 and hg38.
.IP \(bu 2
\fBcadd\fP \-\- \fI\%CADD\fP evaluates the
potential impact of variations. It is freely available for non\-commercial
research, but requires licensing for commercial usage. The download is 30Gb and
GEMINI will include CADD annotations if present.
.IP \(bu 2
\fBvep\fP \-\- Data files for the \fI\%Variant Effects Predictor (VEP)\fP\&. To use VEP as an
alternative to the default installed snpEff, set \fBvep\fP in the
variant\-config configuration.
.IP \(bu 2
\fBdbnsfp\fP Like CADD, \fI\%dbNSFP\fP
provides integrated and generalized metrics from multiple sources to help with
prioritizing variations for follow up. The files are large: dbNSFP is 10Gb,
expanding to 100Gb during preparation. VEP will use dbNSFP for annotation of
VCFs if included.
.IP \(bu 2
\fBdbscsnv\fP \fI\%dbscSNV\fP
includes all potential human SNVs within splicing consensus regions
(−3 to +8 at the 5’ splice site and −12 to +2 at the 3’ splice site), i.e. scSNVs,
related functional annotations and two ensemble prediction scores for predicting their potential of altering splicing.
VEP will use dbscSNV for annotation of VCFs if included.
.IP \(bu 2
\fBbattenberg\fP Data files for \fI\%Battenberg\fP, which detects subclonality and
copy number changes in whole genome cancer samples.
.IP \(bu 2
\fBkraken\fP Database for \fI\%Kraken\fP,
optionally used for contamination detection.
.IP \(bu 2
\fBericscript\fP Database for \fI\%EricScript\fP,
based gene fusion detection. Supports hg38, hg19 and GRCh37.
.UNINDENT
.sp
For somatic analyses, bcbio includes \fI\%COSMIC\fP
v68 for hg19 and GRCh37 only. Due to license restrictions, we cannot include
updated versions of this dataset and hg38 support with the installer. To prepare
these datasets yourself you can use \fI\%a utility script shipped with cloudbiolinux\fP
that downloads, sorts and merges the VCFs, then copies into your bcbio installation:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
export COSMIC_USER="your@registered.email.edu"
export COSMIC_PASS="cosmic_password"
bcbio_python prepare_cosmic.py 83 /path/to/bcbio
.ft P
.fi
.UNINDENT
.UNINDENT
.SH EXTRA SOFTWARE
.sp
We\(aqre not able to automatically install some useful tools due to licensing
restrictions, so we provide a mechanism to manually download and add these to
bcbio\-nextgen during an upgrade with the \fB\-\-toolplus\fP command line.
.SS GATK and MuTect/MuTect2
.sp
bcbio includes an installation of GATK4, which is freely available for all uses.
This is the default runner for HaplotypeCaller or MuTect2. If you want to use an
older version of GATK, it requires manual installation. This is freely available
for academic users, but requires a \fI\%license for commerical use\fP\&. It is not freely
redistributable so requires a manual download from the \fI\%GATK download\fP site.
You also need to include \fBtools_off: [gatk4]\fP in your configuration for runs:
see config\-changing\-defaults\&.
.sp
To install GATK3, register with the pre\-installed gatk bioconda wrapper:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
gatk\-register /path/to/GenomeAnalysisTK.tar.bz2
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If you\(aqre not using the most recent post\-3.6 version of GATK, or using a nightly
build, you can add \fB\-\-noversioncheck\fP to the command line to skip comparisons
to the GATK version.
.sp
\fI\%MuTect2\fP is distributed with GATK in versions 3.5 and later.
.sp
To install versions of GATK < 3.6, download and unzip the latest version from
the GATK distribution. Then make this jar available to bcbio\-nextgen with:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_nextgen.py upgrade \-\-tools \-\-toolplus gatk=/path/to/gatk/GenomeAnalysisTK.jar
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This will copy the jar and update your bcbio_system.yaml and manifest files to
reflect the new version.
.sp
MuTect also has similar licensing terms and requires a license for commerical
use. After \fI\%downloading the MuTect jar\fP, make it available to bcbio:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_nextgen.py upgrade \-\-tools \-\-toolplus mutect=/path/to/mutect/mutect\-1.1.7.jar
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Note that muTect does not provide an easy way to query for the current version,
so your input jar needs to include the version in the name.
.SH SYSTEM REQUIREMENTS
.sp
bcbio\-nextgen provides a wrapper around external tools and data, so the actual
tools used drive the system requirements. For small projects, it should install
on workstations or laptops with a couple Gb of memory, and then scale as needed
on clusters or multicore machines.
.sp
Disk space requirements for the tools, including all system packages are under
4Gb. Biological data requirements will depend on the genomes and aligner indices
used, but a suggested install with GRCh37 and bowtie/bwa2 indexes uses
approximately 35Gb of storage during preparation and ~25Gb after:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ du \-shc genomes/Hsapiens/GRCh37/*
3.8G  bowtie2
5.1G  bwa
3.0G  rnaseq\-2014\-05\-02
3.0G  seq
340M  snpeff
4.2G  variation
4.4G  vep
23.5G total
.ft P
.fi
.UNINDENT
.UNINDENT
.SH TROUBLESHOOTING
.SS Proxy or firewall problems
.sp
Some steps retrieve third party tools from GitHub, which can run into
issues if you\(aqre behind a proxy or block git ports. To instruct git to
use \fBhttps://\fP globally instead of \fBgit://\fP:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ git config \-\-global url.https://github.com/.insteadOf git://github.com/
.ft P
.fi
.UNINDENT
.UNINDENT
.SS GATK or Java Errors
.sp
Most software tools used by bcbio require Java 1.8. bcbio distributes an OpenJDK
Java build and uses it so you don\(aqt need to install anything. Older versions of
GATK (< 3.6) and MuTect require a locally installed Java 1.7. If you
have version incompatibilities, you\(aqll see errors like:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
Unsupported major.minor version 51.0
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Fixing this requires either installing Java 1.7 for old GATK and MuTect or
avoiding pointing to an incorrect java (\fBunset JAVA_HOME\fP). You can also tweak
the java used by bcbio, described in the \fI\%Automated\fP installation
section.
.SS ImportErrors
.sp
Import errors with tracebacks containing Python libraries outside of the bcbio
distribution (\fB/path/to/bcbio/anaconda\fP) are often due to other conflicting
Python installations. bcbio tries to isolate itself as much as possible but
external libraries can get included during installation due to the
PYTHONHOME or PYTHONPATH environmental variables or local site libraries.
These commands will temporary unset those to get bcbio installed, after which it
should ignore them automatically:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ unset PYTHONHOME
$ unset PYTHONPATH
$ export PYTHONNOUSERSITE=1
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Finally, having a \fB\&.pydistutils.cfg\fP file in your home directory can mess with
where the libraries get installed. If you have this file in your
home directory, temporarily renaming it to something else may fix
your installation issue.
.SH MANUAL PROCESS
.sp
The manual process does not allow the in\-place updates and management of third
party tools that the automated installer makes possible. It\(aqs a more error\-prone
and labor intensive process. If you find you can\(aqt use the installer we\(aqd love
to hear why to make it more amenable to your system. If you\(aqd like to develop
against a bcbio installation, see the documentation on setting up a
code\-devel\-infrastructure\&.
.SS Tool Requirements
.sp
The code drives a number of next\-generation sequencing analysis tools
that you need to install on any machines involved in the processing. The
\fI\%CloudBioLinux\fP toolkit provides automated scripts to help with installation
for both software and associated data files:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
fab \-f cloudbiolinux/fabfile.py \-H localhost install_biolinux:flavor=ngs_pipeline_minimal
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
You can also install them manually, adjusting locations in the \fBresources\fP
section of your \fBbcbio_system.yaml\fP configuration file as needed. The
CloudBioLinux infrastructure provides a full list of third party software
installed with bcbio\-nextgen in 
.nf
\(gapackages\-conda.yaml\(ga_
.fi
, which lists all third
party tools installed through \fI\%Bioconda\fP
.SS Data requirements
.sp
In addition to existing bioinformatics software the pipeline requires
associated data files for reference genomes, including pre\-built indexes
for aligners. The \fI\%CloudBioLinux\fP toolkit again provides an automated
way to download and prepare these reference genomes:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
fab \-f data_fabfile.py \-H localhost \-c your_fabricrc.txt install_data_s3:your_biodata.yaml
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The \fI\%biodata.yaml\fP file contains information about what genomes to
download. The \fI\%fabricrc.txt\fP describes where to install the genomes
by adjusting the \fBdata_files\fP variable. This creates a tree
structure that includes a set of Galaxy\-style location files to
describe locations of indexes:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
├── galaxy
│\ \  ├── tool\-data
│\ \  │\ \  ├── alignseq.loc
│\ \  │\ \  ├── bowtie_indices.loc
│\ \  │\ \  ├── bwa_index.loc
│\ \  │\ \  ├── sam_fa_indices.loc
│\ \  │\ \  └── twobit.loc
│\ \  └── tool_data_table_conf.xml
├── genomes
│\ \  ├── Hsapiens
│\ \  │\ \  ├── GRCh37
│\ \  │\ \  └── hg19
│\ \  └── phiX174
│\ \      └── phix
└── liftOver
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Individual genome directories contain indexes for aligners in
individual sub\-directories prefixed by the aligner name. This
structured scheme helps manage aligners that don\(aqt have native Galaxy
\fI\&.loc\fP files. The automated installer will download and set this up
automatically:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
\(ga\-\- phix
    |\-\- bowtie
    |   |\-\- phix.1.ebwt
    |   |\-\- phix.2.ebwt
    |   |\-\- phix.3.ebwt
    |   |\-\- phix.4.ebwt
    |   |\-\- phix.rev.1.ebwt
    |   \(ga\-\- phix.rev.2.ebwt
    |\-\- bowtie2
    |   |\-\- phix.1.bt2
    |   |\-\- phix.2.bt2
    |   |\-\- phix.3.bt2
    |   |\-\- phix.4.bt2
    |   |\-\- phix.rev.1.bt2
    |   \(ga\-\- phix.rev.2.bt2
    |\-\- bwa
    |   |\-\- phix.fa.amb
    |   |\-\- phix.fa.ann
    |   |\-\- phix.fa.bwt
    |   |\-\- phix.fa.pac
    |   |\-\- phix.fa.rbwt
    |   |\-\- phix.fa.rpac
    |   |\-\- phix.fa.rsa
    |   \(ga\-\- phix.fa.sa
    |\-\- novoalign
    |   \(ga\-\- phix
    |\-\- seq
    |   |\-\- phix.dict
    |   |\-\- phix.fa
    |   \(ga\-\- phix.fa.fai
    \(ga\-\- ucsc
        \(ga\-\- phix.2bit
.ft P
.fi
.UNINDENT
.UNINDENT
.SH GERMLINE VARIANT CALLING
.sp
bcbio implements configurable SNP, indel and structural variant calling for
germline populations. We include whole genome and exome evaluations against
reference calls from the \fI\%Genome in a Bottle\fP consortium and \fI\%Illumina Platinum
Genomes\fP project, enabling continuous
assessment of new alignment and variant calling algorithms. We regularly report
on these comparisons and continue to improve approaches as the community makes
new tools available. Here is some of the research that contributes to the
current implementation:
.INDENT 0.0
.IP \(bu 2
An introduction to the \fI\%variant evaluation framework\fP\&. This includes a
comparison of the \fI\%bwa mem\fP and \fI\%novoalign\fP aligners. We also compared the
\fI\%FreeBayes\fP, \fI\%GATK HaplotypeCaller\fP and \fI\%GATK UnifiedGenotyper\fP variant
callers.
.IP \(bu 2
An in\-depth evaluation of \fI\%FreeBayes and BAM post\-alignment processing\fP\&. We
found that FreeBayes quality was equal to GATK HaplotypeCaller. Additionally,
a lightweight post\-alignment preparation method using only de\-duplication was
equivalent to GATK\(aqs recommended Base Quality Score Recalibration (BQSR) and
realignment around indels, when using good quality input datasets and callers
that do local realignment.
.IP \(bu 2
Additional work to \fI\%improve variant filtering\fP, providing methods to
remove low complexity regions (LCRs) that can bias indel results. We also
tuned \fI\%GATK\(aqs Variant Quality Score Recalibrator\fP (VQSR) and compared it with
cutoff\-based soft filtering. VQSR requires a large number of variants and we use
it in bcbio with GATK HaplotypeCaller when your algorithm\-config
contains high depth samples (\fBcoverage_depth\fP is not low) and you are
calling on the whole genome (\fBcoverage_interval\fP is genome) or have more
than 50 regional or exome samples called concurrently.
.IP \(bu 2
An \fI\%evaluation of joint calling\fP with GATK HaplotypeCaller, FreeBayes,
Platypus and samtools. This validates the joint calling implementation,
allowing scaling of large population germline experiments. It also
demonstrates improved performance of new callers: samtools 1.0 and Platypus.
.IP \(bu 2
Support for \fI\%build 38 of the human genome\fP, improving precision of
detection thanks to the improved genome representation.
.UNINDENT
.sp
bcbio automates post\-variant calling annotation to make
the outputs easier to feed directly into your biological analysis. We annotate
variant effects using \fI\%snpEff\fP or \fI\%Variant Effect Predictor\fP (VEP), and
prepare a \fI\%GEMINI database\fP that associates variants with multiple
external annotations in a SQL\-based query interface. GEMINI databases have the
most associated external information for human samples (GRCh37/hg19 and hg38)
but are available for any organism with the database populated using the VCF
INFO column and predicted effects.
.SS Basic germline calling
.sp
The best approach to build a bcbio docs\-config for germline calling is to use
the automated\-sample\-config with one of the default templates:
.INDENT 0.0
.IP \(bu 2
\fI\%FreeBayes template\fP \-\-
Call variants using FreeBayes with a minimal preparation pipeline. This is a
freely available unrestricted pipeline fully included in the bcbio installation.
.IP \(bu 2
\fI\%GATK HaplotypeCaller template\fP \-\-
Run GATK best practices, including Base Quality Score Recalibration,
realignment and HaplotypeCaller variant calling. This requires a license from
Broad for commercial use. You need to manually install GATK along with bcbio
using downloads from the GATK Broad site or Appistry (see toolplus\-install).
.UNINDENT
.sp
You may also want to enable \fI\%Structural variant calling\fP for detection of larger events,
which work with either caller. Another good source of inspiration are the
configuration files from the example\-pipelines, which may help identify
other configuration variables of interest. A more complex setup with multiple
callers and resolution of ensemble calls is generally only useful with a small
population where you are especially concerned about sensitivity. Single
caller detection with FreeBayes or GATK HaplotypeCaller provide good resolution
of events.
.SS Population calling
.sp
When calling multiple samples, we recommend calling together to provide improved
sensitivity and a fully squared off final callset. To associate samples together
in a population add a \fBmetadata\fP \fBbatch\fP to the sample\-configuration:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
\- description: Sample1
  metadata:
    batch: Batch1
\- description: Sample2
  metadata:
    batch: Batch1
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Batching samples results in output VCFs and GEMINI databases containing
all merged sample calls. bcbio has two methods to call samples together:
.INDENT 0.0
.IP \(bu 2
Batch or pooled calling \-\- This calls all samples simultaneously by feeding
them to the variant caller. This works for smaller batch sizes (< 100 samples)
as memory requirements become limiting in larger pools. This is the default
approach taken when you specify a \fBvariantcaller\fP in the
variant\-config configuration.
.IP \(bu 2
Joint calling \-\- This calls samples independently, then combines them together
into a single callset by integrating the individual calls. This scales to
larger population sizes by avoiding the computational bottlenecks of pooled
calling. We recommend joint calling with HaplotypeCaller if you have a
license for GATK usage, but also support joint calling with FreeBayes using a
custom implementation. Specifying a \fBjointcaller\fP along with the appropriate
\fBvariantcaller\fP in the variant\-config configuration enables this:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
\- description: Sample1
  algorithm:
    variantcaller: gatk\-haplotype
    jointcaller: gatk\-haplotype\-joint
  metadata:
    batch: Batch1
\- description: Sample2
  algorithm:
    variantcaller: gatk\-haplotype
    jointcaller: gatk\-haplotype\-joint
  metadata:
    batch: Batch1
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.SH CANCER VARIANT CALLING
.sp
bcbio supports somatic cancer calling with tumor and optionally matched normal pairs using
multiple SNP, indel and structural variant callers. A \fI\%full evaluation of cancer calling\fP
validates callers against \fI\%synthetic dataset 3 from the ICGC\-TCGA DREAM challenge\fP\&.
bcbio uses a majority voting ensemble approach to combining calls from
multiple SNP and indel callers, and also flattens structural variant calls into a
combined representation.
.sp
The \fI\%example configuration\fP
for the example\-cancer validation is a good starting point for setting up
a tumor/normal run on your own dataset. The configuration works similarly to
population based calling. Supply a consistent batch for tumor/normal pairs and
mark them with the phenotype:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
\- description: your\-tumor
  algorithm:
    variantcaller: [vardict, strelka2, mutect2]
  metadata:
    batch: batch1
    phenotype: tumor
\- description: your\-normal
  algorithm:
    variantcaller: [vardict, strelka2, mutect2]
  metadata:
    batch: batch1
    phenotype: normal
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Other config\-cancer configuration options allow tweaking of the
processing parameters. For pairs you want to analyze together, specify a
consistent set of \fBvariantcaller\fP options for both samples.
.sp
Cancer calling handles both tumor\-normal paired calls and tumor\-only calling. To
specify a tumor\-only sample, provide a single sample labeled with \fBphenotype:
tumor\fP\&. Otherwise the configuration and setup is the same as with paired
analyses. For tumor\-only samples, bcbio will try to remove likely germline
variants present in the public databases like 1000 genomes and ExAC, and not in
COSMIC. This runs as long as you have a local GEMINI data installation
(\fB\-\-datatarget gemini\fP) and marks likely germline variants with a
\fBLowPriority\fP filter. \fI\%This post has more details\fP on the approach and validation.
.sp
The standard variant outputs (\fBsample\-caller.vcf.gz\fP) for tumor calling
emphasize somatic differences, those likely variants unique to the cancer. If
you have a tumor\-only sample and GEMINI data installed, it will also output
\fBsample\-caller\-germline.vcf.gz\fP, which tries to identify germline background
mutations based on presence in public databases. If you have tumor/normal data
and would like to also call likely germline mutations see the documentation on
specifying a germline caller: \fI\%Somatic with germline variants\fP\&.
.sp
We\(aqre actively working on improving calling to better account for the
heterogeneity and structural variability that define cancer genomes.
.SH SOMATIC WITH GERMLINE VARIANTS
.sp
For tumor/normal somatic samples, bcbio can call both somatic (tumor\-specific)
and germline (pre\-existing) variants. The typical outputs of
\fI\%Cancer variant calling\fP are likely somatic variants acquired by the cancer, but
pre\-existing germline risk variants are often also diagnostic.
.sp
For tumor\-only cases we suggest running standard \fI\%Cancer variant calling\fP\&.
Tumor\-only inputs mix somatic and germline variants, making it difficult to
separate events. For small variants (SNPs and indels) bcbio will attempt to
distinguish somatic and germline mutations using the presence of variants in
population databases.
.sp
To option somatic and germline calls for your tumor/normal inputs, specify
which callers to use for each step in the variant\-config configuration:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
description: your\-normal
variantcaller:
   somatic: vardict
   germline: freebayes
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
bcbio does a single alignment for the normal sample, then splits at the variant
calling steps using this normal sample to do germline calling. In this example,
the output files are:
.INDENT 0.0
.IP \(bu 2
\fByour\-tumor/your\-tumor\-vardict.vcf.gz\fP \-\- Somatic calls from the tumor
samples using the normal as background to subtract existing calls.
.IP \(bu 2
\fByour\-normal/your\-normal\-freebayes.vcf.gz\fP \-\- Germline calls on the normal
sample.
.UNINDENT
.sp
Germline calling supports multiple callers, and other configuration options like
ensemble and structural variant calling inherit from the remainder configuration. For
example, to use 3 callers for somatic and germline calling, create ensemble calls
for both and include germline and somatic events from two structural variant
callers:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
variantcaller:
   somatic: [vardict, strelka2, mutect2]
   germline: [freebayes, gatk\-haplotype, strelka2]
ensemble:
   numpass: 2
svcaller: [manta, cnvkit]
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
In addition to the somatic and germline outputs attached to the tumor and normal
sample outputs as described above, you\(aqll get:
.INDENT 0.0
.IP \(bu 2
\fByour\-tumor/your\-tumor\-manta.vcf.gz\fP \-\- Somatic structural variant calls for
each specified \fBsvcaller\fP\&. These will have genotypes for both the tumor and
normal samples, with somatic calls labeled as PASS variants.
.IP \(bu 2
\fByour\-normal/your\-normal\-manta.vcf.gz\fP \-\- Germline structural variant calls
for each specified \fBsvcaller\fP\&. We expect these to be noisier than the
somatic calls due to the lack of a reference sample to help remove technical noise.
.UNINDENT
.SH STRUCTURAL VARIANT CALLING
.sp
bcbio can detect larger structural variants like deletions, insertions, inversions
and copy number changes for both germline population and cancer variant calling,
based on validation against existing truth sets:
.INDENT 0.0
.IP \(bu 2
\fI\%Validation of germline structural variant detection\fP using multiple calling methods
to validate against deletions in NA12878. This implements a pipeline that
works in tandem with SNP and indel calling to detect larger structural
variations like deletions, duplications, inversions and copy number variants
(CNVs).
.IP \(bu 2
\fI\%Validation of tumor/normal calling\fP
using the synthetic DREAM validation set. This includes validation of
additional callers against duplications, insertions and inversions.
.UNINDENT
.sp
To enable structural variant calling, specify \fBsvcaller\fP options in the
algorithm section of your configuration:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
\- description: Sample
  algorithm:
    svcaller: [lumpy, manta, cnvkit]
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The best supported callers are \fI\%Lumpy\fP and
\fI\%Manta\fP, for paired end and split read
calling, \fI\%CNVkit\fP for read\-depth
based CNV calling, and \fI\%WHAM\fP for
association testing. We also support \fI\%DELLY\fP, another excellent paired end and
split read caller, although it is slow on large whole genome datasets.
.sp
In addition to results from individual callers, bcbio can create a summarized
ensemble callset using \fI\%MetaSV\fP\&. We\(aqre
actively working on improved structural variant reporting to highlight potential
variants of interest.
.SH RNA-SEQ
.sp
bcbio can also be used to analyze RNA\-seq data. It includes steps for quality
control, adapter trimming, alignment, variant calling, transcriptome
reconstruction and post\-alignment quantitation at the level of the gene
and isoform.
.sp
We recommend using the STAR aligner for all genomes where there are no alt
alleles. For genomes such as hg38 that have alt alleles, hisat2 should be used
as it handles the alts correctly and STAR does not yet. Use Tophat2 only
if you do not have enough RAM available to run STAR (about 30 GB).
.sp
Our current recommendation is to run adapter trimming only if using the Tophat2
aligner. Adapter trimming is very slow, and aligners that soft clip the ends of
reads such as STAR and hisat2, or algorithms using pseudoalignments like
Sailfish handle contaminant sequences at the ends properly. This makes trimming
unnecessary. Tophat2 does not perform soft clipping so if using Tophat2,
trimming must still be done.
.sp
Salmon, which is an extremely fast alignment\-free method of quantitation, is
run for all experiments. Salmon can accurately quantitate the expression of
genes, even ones which are hard to quantitate with other methods (see \fI\%this
paper\fP for example for Sailfish,
which performs similarly to Salmon.). Salmon can also
quantitate at the transcript level which can help gene\-level analyses (see
\fI\%this paper\fP for example).
We recommend using the Salmon quantitation rather than the counts from
featureCounts to perform downstream quantification.
.sp
Although we do not recommend using the featureCounts based counts, the alignments
are still useful because they give you many more quality metrics than the
quasi\-alignments from Salmon.
.sp
After a bcbio RNA\-seq run there will be in the \fBupload\fP directory a directory
for each sample which contains a BAM file of the aligned and unaligned reads, a
\fBsailfish\fP directory with the output of Salmon, including TPM values, and a
\fBqc\fP directory with plots from FastQC and qualimap.
.sp
In addition to directories for each sample, in the \fBupload\fP directory there is
a project directory which contains a YAML file describing some summary
statistics for each sample and some provenance data about the bcbio run. In that
directory is also a \fBcombined.counts\fP file with the featureCounts derived
counts per cell.
.SH FAST RNA-SEQ
.sp
This mode of \fBbcbio\-nextgen\fP quantitates transcript expression using \fI\%Salmon\fP and does nothing else. It is an
order of magnitude faster or more than running the full RNA\-seq analysis. The
cost of the increased speed is that you will have much less information about
your samples at the end of the run, which can make troubleshooting trickier.
Invoke with \fBanalysis: fastrna\-seq\fP\&.
.SH SINGLE-CELL RNA-SEQ
.sp
bcbio\-nextgen supports universal molecular identifiers (UMI) based single\-cell
RNA\-seq analyses. If your single\-cell prep does not use universal molecular
identifiers (UMI), you can most likely just run the standard RNA\-seq pipeline
and use the results from that. The UMI are used to discard reads which
are possibly PCR duplicates and is very helpful for removing some of the
PCR duplicate noise that can dominate single\-cell experiments.
.sp
Unlike the standard RNA\-seq pipeline, the single\-cell pipeline expects the FASTQ
input files to not be separated by cellular barcode, so each file is a mix of
cells identified by a cellular barcode (CB), and unique reads from a transcript
are identified with a UMI. bcbio\-nextgen inspects each read, identifies the
cellular barcode and UMI and puts them in the read name. Then the reads are
aligned to the transcriptome with \fI\%RapMap\fP
and the number of reads aligning to each transcript is counted for each cellular
barcode. The output is a table of counts with transcripts as the rows and
columns as the cellular barcodes for each input FASTQ file.
.sp
Optionally the reads can be quantitated with \fBkallisto\fP to output transcript
compatibility counts rather than counts per gene
(\fI\%TCC paper\fP).
.sp
To extract the UMI and cellular barcodes from the read, bcbio\-nextgen
needs to know where the UMI and the cellular barcode are expected to be
in the read. Currently there is support for two schemes, the inDrop system from
the Harvard single\-cell core facility and CEL\-seq. If bcbio\-nextgen does not
support your UMI and barcoding scheme, please open up an issue and we will
help implement support for it.
.sp
Most of the heavy lifting for this part of \fIbcbio\-nextgen\fP is implemented in
the \fI\%umis\fP repository.
.SH SMALLRNA-SEQ
.sp
bcbio\-nextgen also implements a configurable best\-practices pipeline for smallRNA\-seq
quality controls, adapter trimming, miRNA/isomiR quantification and other small RNA
detection.
.INDENT 0.0
.IP \(bu 2
Adapter trimming:
.INDENT 2.0
.IP \(bu 2
\fI\%atropos\fP
.IP \(bu 2
\fI\%dnapi\fP for adapter de\-novo detection
.UNINDENT
.IP \(bu 2
Sequence alignment:
.INDENT 2.0
.IP \(bu 2
\fI\%STAR\fP for genome annotation
.IP \(bu 2
bowtie, \fIbowtie2\fP and  \fI\%hisat2\fP for genome annotation as an option
.UNINDENT
.IP \(bu 2
Specific small RNAs quantification (miRNA/tRNAs...):
.INDENT 2.0
.IP \(bu 2
\fI\%seqbuster\fP for miRNA annotation
.IP \(bu 2
\fI\%MINTmap\fP for tRNA fragments annotation
.IP \(bu 2
.INDENT 2.0
.TP
.B \fI\%miRge2\fP for alternative small RNA quantification. To setup this tool, you need
install manually miRge2.0, and download the library data for your species.
Read how to install and download the data \fI\%here\fP\&.
If you have \fBhuman\fP folder at \fB/mnt/data/human\fP
the option to pass to resources will be \fB/mnt/data\fP\&.
Then setup \fBresources\fP:
.TP
.B resources:
.INDENT 7.0
.TP
.B mirge:
options: ["\-lib $PATH_TO_PARENT_SPECIES_LIB"]
.UNINDENT
.UNINDENT
.UNINDENT
.IP \(bu 2
Quality control:
.INDENT 2.0
.IP \(bu 2
\fI\%FastQC\fP
.UNINDENT
.IP \(bu 2
Other small RNAs quantification:
.INDENT 2.0
.IP \(bu 2
\fI\%seqcluster\fP
.IP \(bu 2
\fI\%mirDeep2\fP for miRNA prediction
.UNINDENT
.UNINDENT
.sp
The pipeline generates a _RMD_ template file inside \fBreport\fP folder
that can be rendered with knitr. An example of the report is at \fI\%here\fP\&.
Count table (\fBcounts_mirna.tst\fP) from mirbase miRNAs will be
inside \fBmirbase\fP or final project folder.
Input files for \fI\%isomiRs\fP package for isomiRs analysis will be
inside each sample in \fBmirbase\fP folder..
If mirdeep2 can run, count table (\fBcounts_mirna_novel.tsv\fP)
for novel miRNAs will be inside
\fBmirdeep2\fP or final project folder.
tdrmapper results will be inside each sample
inside \fBtdrmapper\fP or final project folder.
.SH CHIP-SEQ
.sp
The bcbio\-nextgen implementation of ChIP\-seq aligns, removes multimapping reads,
calls peaks with a paired input file using MACS2 and outputs a set of greylist
regions for filtering possible false peaks in regions of high depth in the input
file.
.INDENT 0.0
.IP \(bu 2
Adapter trimming:
\- \fI\%atropos\fP
.IP \(bu 2
Sequence alignment:
\- \fI\%bowtie2\fP, \fI\%bwa mem\fP
.IP \(bu 2
Peak calling:
\- \fI\%macs2\fP
.IP \(bu 2
Greylisting:
\- \fI\%chipseq\-greylist\fP
.IP \(bu 2
Quality control:
\- \fI\%FastQC\fP
.UNINDENT
.SH STANDARD
.sp
This pipeline implements \fBalignment\fP and \fBqc\fP tools. Furthermore, it will
run \fI\%qsignature\fP to detect possible duplicated samples, or mislabeling. It
uses SNPs signature to create a distance matrix that helps easily to create
groups. The project yaml file will show the number of total samples analyzed,
the number of very similar samples, and samples that could be duplicated.
.SS Configuration
.sp
We will assume that you installed bcbio\-nextgen with the automated installer,
and so your default \fI\%bcbio_system.yaml\fP file is configured correctly with all
of the tools pointing to the right places. If that is the case, to run
bcbio\-nextgen on a set of samples you just need to set up a YAML file that
describes your samples and what you would like to do to them. Let\(aqs say that you
have a single paired\-end control lane, prepared with the Illumina \fI\%TruSeq\fP Kit
from a human. Here is what a well\-formed sample YAML file for that RNA\-seq
experiment would look like:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
fc_date: \(aq070113\(aq
fc_name: control_experiment
upload:
  dir: final
details:
  \- files: [/full/path/to/control_1.fastq, /full/path/to/control_2.fastq]
    description: \(aqControl_rep1\(aq
    genome_build: GRCh37
    analysis: RNA\-seq
    algorithm:
         aligner: tophat2
         quality_format: Standard
         trim_reads: read_through
         adapters: [truseq, polya]
         strandedness: unstranded
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
\fBfc_date\fP and \fBfc_name\fP will be combined to form a prefix to name
intermediate files, and can be set to whatever you like. \fBupload\fP is
explained pretty well in the \fI\%configuration documentation\fP and the above will
direct bcbio\-nextgen to put the output files from the pipeine into the \fBfinal\fP
directory.  Under \fBdetails\fP is a list of sections each describing a sample to
process.  You can set many \fI\%parameters\fP under each section but most of
the time just setting a few like the above is all that is necessary.
\fBanalysis\fP tells bcbio\-nextgen to run the best\-practice RNA\-seq pipeline on
this sample.
.sp
In the above, since there are two files, \fBcontrol_1.fastq\fP and
\fBcontrol_2.fastq\fP will be automatically run as paired\-end data. If you have
single end data you can just supply one file and it will run as single\-end. The
\fBdescription\fP field will be used to eventually rename the files, so make it
very evocative since you will be looking at it a lot later. \fBgenome_build\fP is
self\-explanatory.
.sp
Sometimes you need a little bit more flexibility than the standard pipeline, and
the \fBalgorithm\fP section has many options to fine\-tune the behavior of the
algorithm. \fBquality_format\fP tells bcbio\-nextgen what quality format your FASTQ
inputs are using, if your samples were sequenced any time past 2009 or so, you
probably want to set it to \fBStandard\fP\&. Adapter read\-through is a problem in
RNA\-seq libraries, so we want to trim off possible adapter sequences on the ends
of reads, so \fBtrim_reads\fP is set to \fBread_through\fP, which will also trim off
poor quality ends. Since your library is a RNA\-seq library prepared with the
TruSeq kit, the set of adapters to trim off are the TruSeq adapters and possible
polyA tails, so \fBadapters\fP is set to both of those. \fBstrandedness\fP
can be set if your library was prepared in a strand\-specific manner and can
be set to firststrand, secondstrand or unstranded (the default).
.SS Multiple samples
.sp
Lets say you have a set of mouse samples to analyze and each sample is a single
lane of single\-end RNA\-seq reads prepared using the NextEra kit.  There are
two case and two control samples. Here is a
sample configuration file for that analysis:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
fc_date: \(aq070113\(aq
fc_name: mouse_analysis
upload:
  dir: final
details:
  \- files: [/full/path/to/control_rep1.fastq]
    description: \(aqControl_rep1\(aq
    genome_build: mm10
    analysis: RNA\-seq
    algorithm:
         aligner: tophat2
         quality_format: Standard
         trim_reads: read_through
         adapters: [nextera, polya]
  \- files: [/full/path/to/control_rep2.fastq]
    description: \(aqControl_rep2\(aq
    genome_build: mm10
    analysis: RNA\-seq
    algorithm:
         aligner: tophat2
         quality_format: Standard
         trim_reads: read_through
         adapters: [nextera, polya]
  \- files: [/full/path/to/case_rep1.fastq]
    description: \(aqCase_rep1\(aq
    genome_build: mm10
    analysis: RNA\-seq
    algorithm:
         aligner: tophat2
         quality_format: Standard
         trim_reads: read_through
         adapters: [nextera, polya]
  \- files: [/full/path/to/case_rep2.fastq]
    description: \(aqCase_rep2\(aq
    genome_build: mm10
    analysis: RNA\-seq
    algorithm:
         aligner: tophat2
         quality_format: Standard
         trim_reads: read_through
         adapters: [nextera, polya]
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
More samples are added just by adding more entries under the details section.
This is tedious and error prone to do by hand, so there is an automated
\fI\%template\fP system for common experiments. You could set up the previous
experiment by making a mouse version of the \fI\%illumina\-rnaseq\fP template
file and saving it to a local file such as \fBillumina\-mouse\-rnaseq.yaml\fP\&. Then
you can set up the sample file using the templating system:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_nextgen.py \-w template illumina\-mouse\-rnaseq.yaml mouse_analysis
/full/path/to/control_rep1.fastq /full/path/to/control_rep2.fastq
/full/path/to/case_rep1.fastq /full/path/to/case_rep2.fastq
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If you had paired\-end samples instead of single\-end samples, you can still use
the template system as long as the forward and reverse read filenames are
the same, barring a _1 and _2. For example: control_1.fastq and control_2.fastq
will be detected as paired and combined in the YAML file output by the
templating system.
.SH OVERVIEW
.INDENT 0.0
.IP 1. 3
Create a \fI\%sample configuration file\fP for your project
(substitute the example BAM and fastq names below with the full
path to your sample files):
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_nextgen.py \-w template gatk\-variant project1 sample1.bam sample2_1.fq sample2_2.fq
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This uses a standard template (GATK best practice variant calling)
to automate creation of a full configuration for all samples. See
automated\-sample\-config for more details on running the
script, and manually edit the base template or final output
file to incorporate project specific configuration. The example
pipelines provide a good starting point and the
sample\-configuration documentation has full details on
available options.
.IP 2. 3
Run analysis, distributed across 8 local cores:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_nextgen.py bcbio_sample.yaml \-n 8
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 3. 3
Read the docs\-config documentation for full details on
adjusting both the sample and system configuration files to match
your experiment and computational setup.
.UNINDENT
.SH PROJECT DIRECTORY
.sp
bcbio encourages a project structure like:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
my\-project/
├── config
├── final
└── work
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
with the input configuration in the \fBconfig\fP directory, the outputs of the
pipeline in the \fBfinal\fP directory, and the actual processing done in the
\fBwork\fP directory. Run the \fBbcbio_nextgen.py\fP script from inside the \fBwork\fP
directory to keep all intermediates there.  The \fBfinal\fP directory, relative to
the parent directory of the \fBwork\fP directory, is the default location
specified in the example configuration files and gets created during
processing. The \fBfinal\fP directory has all of the finished outputs and you can
remove the \fBwork\fP intermediates to cleanup disk space after confirming the
results. All of these locations are configurable and this project structure is
only a recommendation.
.SH LOGGING
.sp
There are 3 logging files in the \fBlog\fP directory within your working folder:
.INDENT 0.0
.IP \(bu 2
\fBbcbio\-nextgen.log\fP High level logging information about the analysis.
This provides an overview of major processing steps and useful
checkpoints for assessing run times.
.IP \(bu 2
\fBbcbio\-nextgen\-debug.log\fP Detailed information about processes
including stdout/stderr from third party software and error traces
for failures. Look here to identify the status of running pipelines
or to debug errors. It labels each line with the hostname of the
machine it ran on to ease debugging in distributed cluster
environments.
.IP \(bu 2
\fBbcbio\-nextgen\-commands.log\fP Full command lines for all third
party software tools run.
.UNINDENT
.SH EXAMPLE PIPELINES
.sp
We supply example input configuration files for validation
and to help in understanding the pipeline.
.SS Whole genome trio (50x)
.sp
This input configuration runs whole genome variant calling using bwa, GATK
HaplotypeCaller and FreeBayes. It uses a father/mother/child
trio from the \fI\%CEPH NA12878 family\fP: NA12891, NA12892, NA12878.
Illumina\(aqs \fI\%Platinum genomes project\fP has 50X whole genome sequencing of the
three members. The analysis compares results against a reference
NA12878 callset from NIST\(aqs \fI\%Genome in a Bottle\fP initiative.
.sp
To run the analysis do:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
mkdir \-p NA12878\-trio\-eval/config NA12878\-trio\-eval/input NA12878\-trio\-eval/work
cd NA12878\-trio\-eval/config
wget https://raw.github.com/bcbio/bcbio\-nextgen/master/config/examples/NA12878\-trio\-wgs\-validate.yaml
cd ../input
wget https://raw.github.com/bcbio/bcbio\-nextgen/master/config/examples/NA12878\-trio\-wgs\-validate\-getdata.sh
bash NA12878\-trio\-wgs\-validate\-getdata.sh
cd ../work
bcbio_nextgen.py ../config/NA12878\-trio\-wgs\-validate.yaml \-n 16
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This is a large whole genome analysis and meant to test both pipeline scaling
and validation across the entire genome. It can take multiple days to run
depending on available cores. It requires 300Gb for the input files and 1.3Tb
for the work directory. Smaller examples below exercise the pipeline with
less disk and computational requirements.
.sp
We also have a more extensive evaluation that includes 2 additional variant
callers, Platypus and samtools, and 3 different methods of calling variants:
single sample, pooled, and incremental joint calling. This uses the same input
data as above but a different input configuration file:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
mkdir \-p NA12878\-trio\-eval/work_joint
cd NA12878\-trio\-eval/config
wget https://raw.github.com/bcbio/bcbio\-nextgen/master/config/examples/NA12878\-trio\-wgs\-joint.yaml
cd ../work_joint
bcbio_nextgen.py ../config/NA12878\-trio\-wgs\-joint.yaml \-n 16
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Exome with validation against reference materials
.sp
This example calls variants on NA12878 exomes from \fI\%EdgeBio\(aqs\fP
clinical sequencing pipeline, and compares them against reference
materials from NIST\(aqs \fI\%Genome in a Bottle\fP initiative. This supplies
a full regression pipeline to ensure consistency of calling between
releases and updates of third party software. The pipeline performs
alignment with bwa mem and variant calling with FreeBayes, GATK
UnifiedGenotyper and GATK HaplotypeCaller. Finally it integrates all 3
variant calling approaches into a \fI\%combined ensemble callset\fP\&.
.sp
This is a large full exome example with multiple variant callers, so
can take more than 24 hours on machines using multiple cores.
.sp
First get the input configuration file, fastq reads, reference materials and analysis regions:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
mkdir \-p NA12878\-exome\-eval
cd NA12878\-exome\-eval
wget https://raw.github.com/bcbio/bcbio\-nextgen/master/config/examples/NA12878\-exome\-methodcmp\-getdata.sh
bash NA12878\-exome\-methodcmp\-getdata.sh
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Finally run the analysis, distributed on 8 local cores, with:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
cd work
bcbio_nextgen.py ../config/NA12878\-exome\-methodcmp.yaml \-n 8
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The \fBgrading\-summary.csv\fP contains detailed comparisons of the results
to the NIST reference materials, enabling rapid comparisons of methods.
.SS Cancer tumor normal
.sp
This example calls variants using multiple approaches in a paired tumor/normal
cancer sample from the \fI\%ICGC\-TCGA DREAM challenge\fP\&. It uses \fI\%synthetic dataset 3\fP which has multiple
subclones, enabling detection of lower frequency variants. Since the dataset is
freely available and has a truth set, this allows us to do a full evaluation of
variant callers.
.sp
To get the data:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
mkdir \-p cancer\-dream\-syn3/config cancer\-dream\-syn3/input cancer\-dream\-syn3/work
cd cancer\-dream\-syn3/config
wget https://raw.githubusercontent.com/bcbio/bcbio\-nextgen/master/config/examples/cancer\-dream\-syn3.yaml
cd ../input
wget https://raw.githubusercontent.com/bcbio/bcbio\-nextgen/master/config/examples/cancer\-dream\-syn3\-getdata.sh
bash cancer\-dream\-syn3\-getdata.sh
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Run with:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
cd ../work
bcbio_nextgen.py ../config/cancer\-dream\-syn3.yaml \-n 8
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The configuration and data file has downloads for exome only and whole genome
analyses. It enables exome by default, but you can use the larger whole genome
evaluation by uncommenting the relevant parts of the configuration and retrieval
script.
.SS Cancer\-like mixture with Genome in a Bottle samples
.sp
This example simulates somatic cancer calling using a mixture of two Genome in a
Bottle samples, NA12878 as the "tumor" mixed with NA24385 as the background.
The \fI\%Hartwig Medical Foundation\fP
and \fI\%Utrecht Medical Center\fP generated this
"tumor/normal" pair by physical mixing of samples prior to sequencing. The GiaB
FTP directory has \fI\%more details on the design and truth sets\fP\&.
The sample has variants at 15% and 30%, providing the ability to look at lower
frequency mutations.
.sp
To get the data:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
wget https://raw.githubusercontent.com/bcbio/bcbio\-nextgen/master/config/examples/cancer\-giab\-na12878\-na24385\-getdata.sh
bash cancer\-giab\-na12878\-na24385\-getdata.sh
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Then run the analysis with:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
cd work
bcbio_nextgen.py ../config/cancer\-giab\-na12878\-na24385.yaml \-n 16
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Structural variant calling \-\- whole genome NA12878 (50x)
.sp
This example runs structural variant calling with multiple callers (Lumpy, Manta
and CNVkit), providing a combined output summary file and validation metrics
against NA12878 deletions. It uses the same NA12878 input as the whole genome
trio example.
.sp
To run the analysis do:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
mkdir \-p NA12878\-sv\-eval
cd NA12878\-sv\-eval
wget https://raw.github.com/bcbio/bcbio\-nextgen/master/config/examples/NA12878\-sv\-getdata.sh
bash NA12878\-sv\-getdata.sh
cd work
bcbio_nextgen.py ../config/NA12878\-sv.yaml \-n 16
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This is large whole genome analysis and the timing and disk space requirements
for the NA12878 trio analysis above apply here as well.
.SS RNAseq example
.sp
This example aligns and creates count files for use with downstream analyses
using a subset of the SEQC data from the FDA\(aqs Sequencing Quality Control project.
.sp
Get the setup script and run it, this will download six samples from
the SEQC project, three from the HBRR panel and three from the UHRR
panel. This will require about 100GB of disk space for these input
files.  It will also set up a configuration file for the run, using
the templating system:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
wget https://raw.github.com/bcbio/bcbio\-nextgen/master/config/examples/rnaseq\-seqc\-getdata.sh
bash rnaseq\-seqc\-getdata.sh
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Now go into the work directory and run the analysis:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
cd seqc/work
bcbio_nextgen.py ../config/seqc.yaml \-n 8
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This will run a full scale RNAseq experiment using Tophat2 as the
aligner and will take a long time to finish on a single machine. At
the end it will output counts, Cufflinks quantitation and a set of QC
results about each lane. If you have a cluster you can \fI\%parallelize it\fP
to speed it up considerably.
.sp
A nice looking standalone \fI\%report\fP of the bcbio\-nextgen run can be generated using
\fI\%bcbio.rnaseq\fP\&. Check that repository for details.
.SS Human genome build 38
.sp
Validate variant calling on human genome build 38, using two different builds
(with and without alternative alleles) and three different validation datasets
(Genome in a Bottle prepared with two methods and Illumina platinum genomes).
To run:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
mkdir \-p NA12878\-hg38\-val
cd NA12878\-hg38\-val
wget https://raw.github.com/bcbio/bcbio\-nextgen/master/config/examples/NA12878\-hg38\-validate\-getdata.sh
bash NA12878\-hg38\-validate\-getdata.sh
mkdir \-p work
cd work
bcbio_nextgen.py ../config/NA12878\-hg38\-validate.yaml \-n 16
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Whole genome (10x)
.sp
An input configuration for running whole gnome variant calling with
bwa and GATK, using Illumina\(aqs \fI\%Platinum genomes project\fP
(\fI\%NA12878\-illumina.yaml\fP). See this
\fI\%blog post on whole genome scaling\fP for expected run times and more
information about the pipeline. To run the analysis:
.INDENT 0.0
.IP \(bu 2
Create an input directory structure like:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
├── config
│\ \  └── NA12878\-illumina.yaml
├── input
└── work
.ft P
.fi
.UNINDENT
.UNINDENT
.IP \(bu 2
Retrieve inputs and comparison calls:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
cd input
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR091/ERR091571/ERR091571_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR091/ERR091571/ERR091571_2.fastq.gz
.ft P
.fi
.UNINDENT
.UNINDENT
.IP \(bu 2
Retrieve configuration input file:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
cd config
wget https://raw.github.com/bcbio/bcbio\-nextgen/master/config/examples/NA12878\-illumina.yaml
.ft P
.fi
.UNINDENT
.UNINDENT
.IP \(bu 2
Run analysis on 16 core machine:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
cd work
bcbio_nextgen.py ../config/NA12878\-illumina.yaml \-n 16
.ft P
.fi
.UNINDENT
.UNINDENT
.IP \(bu 2
Examine summary of concordance and discordance to comparison calls
from the \fBgrading\-summary.csv\fP file in the work directory.
.UNINDENT
.SH TEST SUITE
.sp
The test suite exercises the scripts driving the analysis, so are a
good starting point to ensure correct installation. Tests use the
\fI\%pytest\fP framework. The tests are available in the bcbio source code:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ git clone https://github.com/bcbio/bcbio\-nextgen.git
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
There is a small wrapper script that finds the py.test and other dependencies
pre\-installed with bcbio you can use to run tests:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ cd tests
$ ./run_tests.sh
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
You can use this to run specific test targets:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ ./run_tests.sh cancer
$ ./run_tests.sh rnaseq
$ ./run_tests.sh devel
$ ./run_tests.sh docker
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Optionally, you can run pytest directly from the bcbio install to tweak more
options. It will be in \fB/path/to/bcbio/anaconda/bin/py.test\fP\&. Pass
\fB\-s\fP to \fBpy.test\fP to see the stdout log, and \fB\-v\fP to make py.test output more
verbose. The tests are marked with labels which you can use to run a
specific subset of the tests using the \fB\-m\fP argument:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ py.test \-m rnaseq
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
To run unit tests:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ py.test tests/unit
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
To run integration pipeline tests:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ py.test tests/integration
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
To run tests which use bcbio_vm:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ py.test tests/bcbio_vm
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
To see the test coverage, add the \fB\-\-cov=bcbio\fP argument to \fBpy.test\fP\&.
.sp
By default the test suite will use your installed system configuration
for running tests, substituting the test genome information instead of
using full genomes. If you need a specific testing environment, copy
\fBtests/data/automated/post_process\-sample.yaml\fP to
\fBtests/data/automated/post_process.yaml\fP to provide a test\-only
configuration.
.sp
Two configuration files, in easy to write \fI\%YAML format\fP, specify
details about your system and samples to run:
.INDENT 0.0
.IP \(bu 2
\fBbcbio_sample.yaml\fP Details about a set of samples to process,
including input files and analysis options. You configure these for
each set of samples to process. This will be the main file prepared for each
sample run and the documentation below details techniques to
help prepare them.
.IP \(bu 2
\fBbcbio_system.yaml\fP High level information about the system, including
locations of installed programs like GATK and cores and memory usage (see
tuning\-cores). These apply across multiple runs. The automated
installer creates a ready to go system configuration file that can be manually
edited to match the system. Find the file in the galaxy sub\-directory within
your installation data location (ie.
\fB/usr/local/share/bcbio\-nextgen/galaxy\fP). To modify system parameters for a
specific run, supply \fI\%Sample or run specific resources\fP in your \fBbcbio_sample.yaml\fP file.
.UNINDENT
.sp
Commented \fI\%system\fP and \fI\%sample\fP example files are available in the
\fBconfig\fP directory. The example\-pipelines section contains
additional examples of ready to run sample files.
.SH AUTOMATED SAMPLE CONFIGURATION
.sp
bcbio\-nextgen provides a utility to create configuration files for
multiple sample inputs using a base template. Start with one of
the \fI\%best\-practice templates\fP, or define your own, then apply to
multiple samples using the template workflow command:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_nextgen.py \-w template freebayes\-variant project1.csv sample1.bam sample2_1.fq sample2_2.fq
.ft P
.fi
.UNINDENT
.UNINDENT
.INDENT 0.0
.IP \(bu 2
\fBfreebayes\-variant\fP is the name of the standard \fBfreebayes\-variant.yaml\fP
input, which the script fetches from GitHub. This argument can also
be a path to a locally customized YAML configuration. In both cases,
the script replicates the single sample template configuration to
all input samples.
.IP \(bu 2
\fBproject1.csv\fP is a comma separated value file containing sample
metadata, descriptions and algorithm tweaks:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
samplename,description,batch,phenotype,sex,variant_regions
sample1,ERR256785,batch1,normal,female,/path/to/regions.bed
sample2,ERR256786,batch1,tumor,,/path/to/regions.bed
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The first column links the metadata to a specific input file. The
template command tries to identify the \fBsamplename\fP from read group
information in a BAM file, or uses the base filename if no read group
information is present. For BAM files, this would be the filename without the
extension and path (\fB/path/to/yourfile.bam => yourfile\fP). For FASTQ
files, the template functionality will identify pairs using standard
conventions (\fB_1\fP and \fB_2\fP, including Illumina extensions like \fB_R1\fP),
so use the base filename without these (\fB/path/to/yourfile_R1.fastq => yourfile\fP).
Note that paired\-end samples sequentially numbered without leading zeros
(e.g., \fBsample_1_1.fastq\fP, \fBsample_1_2.fastq\fP, \fBsample_2_1.fastq\fP, \fBsample_2_2.fastq\fP,
etc., will likely not be parsed correctly; see \fI\%#1919\fP for more info). In addition, \fB\&.\fP characters could be problematic,
so it\(aqs better to avoid this character and use it only as separation
for the file extension.
.INDENT 2.0
.INDENT 3.5
.INDENT 0.0
.INDENT 3.5
The remaining columns can contain:
.UNINDENT
.UNINDENT
.INDENT 0.0
.IP \(bu 2
\fBdescription\fP Changes the sample description, originally
supplied by the file name or BAM read group, to this value. You can also
set the \fBlane\fP, although this is less often done as the default
sequential numbering works here. See the documentation for
\fI\%Sample information\fP on how these map to BAM read groups.
.IP \(bu 2
Algorithm parameters specific for this sample. If the column name matches
an available \fI\%Algorithm parameters\fP, then this value substitutes
into the sample \fBalgorithm\fP, replacing the defaults from the template.
You can also change other information in the BAM read group through the
\fBalgorithm\fP parameters. See \fI\%Alignment\fP configuration
documentation for details on how these map to read group information.
.IP \(bu 2
\fI\%Sample information\fP metadata key/value pairs. Any columns not
falling into the above cases will go into the metadata section. A \fBped\fP
specification will allow bcbio to read family, gender and phenotype
information from a PED input file and use it for batch, sex and phenotype,
respectively. The PED inputs supplement information from the standard
template file, so if you specify a value in the template CSV the PED
information will no overwrite it. Alternatively, \fBped\fP fields can
be specified directly in the metadata as columns. If \fBfamily_id\fP is
specified it will be used as the \fBfamily_id\fP for that sample, otherwise
\fBbatch\fP will be used. The \fBdescription\fP column is used as the
\fBindividual_id\fP column and the \fBphenotype\fP column will be used for as
the \fBaffected\fP column in the PED format:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
samplename,description,phenotype,batch,sex,ethnicity,maternal_id,paternal_id,family_id
NA12878.bam,NA12878,\-9,CEPH,female,\-9,NA12892,NA12891,NA12878FAM
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.sp
Individual column items can contain booleans (true or false), integers, or
lists (separated by semi\-colons). These get converted into the expected time
in the output YAML file. For instance, to specify a sample that should go into
multiple batches:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
samplename,description,phenotype,batch
normal.bam,two_normal,normal,Batch1;Batch2
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
For dictionary inputs like somatic\-w\-germline\-variants setups, you can
separate items in a dictionary with colons and double colons, and also use
semicolons for lists:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
samplename,description,phenotype,variantcaller
tumor.bam,sample1,tumor,germline:freebayes;gatk\-haplotype::somatic:vardict;freebayes
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The name of the metadata file, minus the \fB\&.csv\fP extension, is a
short name identifying the current project. The script creates a
\fBproject1\fP directory containing the sample configuration in
\fBproject1/config/project1.yaml\fP\&.
.IP \(bu 2
The remaining arguments are input BAM or FASTQ files. The script
pairs FASTQ files (identified by \fB_1\fP and \fB_2\fP) and extracts
sample names from input BAMs, populating the \fBfiles\fP and
\fBdescription\fP field in the final configuration file. Specify the
full path to sample files on your current machine.
.UNINDENT
.sp
To make it easier to define your own project specific template, an
optional first step is to download and edit a local template. First
retrieve a standard template:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_nextgen.py \-w template freebayes\-variant project1
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This pulls the current GATK best practice variant calling template
into your project directory in
\fBproject1/config/project1\-template.yaml\fP\&. Manually edit this file to
define your options, then run the full template creation for your
samples, pointing to this custom configuration file:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_nextgen.py \-w template project1/config/project1\-template.yaml project1.csv folder/*
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If your sample folder contains additional BAM or FASTQ files you do not wish to
include in the sample YAML configuration, you can restrict the output to only
include samples in the metadata CSV with \fB\-\-only\-metadata\fP\&. The output will
print warnings about samples not present in the metadata file, then leave these
out of the final output YAML:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_nextgen.py \-w template \-\-only\-metadata project1/config/project1\-template.yaml project1.csv folder/*
.ft P
.fi
.UNINDENT
.UNINDENT
.SH MULTIPLE FILES PER SAMPLE
.sp
In case you have multiple FASTQ or BAM files for each sample you can use \fBbcbio_prepare_samples.py\fP\&.
The main parameters are:
.INDENT 0.0
.IP \(bu 2
\fB\-\-out\fP: the folder where the merged files will be
.IP \(bu 2
\fB\-\-csv\fP: the CSV file that is exactly the same as described previously, but having as many duplicate lines for each sample as files to be merged:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
samplename,description,batch,phenotype,sex,variant_regions
file1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
file2.fastq,sample1,batch1,normal,female,/path/to/regions.bed
file1.fastq,sample2,batch1,tumor,,/path/to/regions.bed
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.sp
An example of usage is:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_prepare_samples.py \-\-out merged \-\-csv project1.csv
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The script will create the \fBsample1.fastq,sample2.fastq\fP in the \fBmerged\fP folder, and a new CSV file
in the same folder than the input CSV :\fBproject1\-merged.csv\fP\&. Later, it can be used for bcbio:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_nextgen.py \-w template project1/config/project1\-template.yaml project1\-merged.csv merged/*fastq
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The new CSV file will look like:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
samplename,description,batch,phenotype,sex,variant_regions
sample1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
sample2.fastq,sample2,batch1,tumor,,/path/to/regions.bed
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
It supports parallelization the same way \fBbcbio_nextgen.py\fP does:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
python $BCBIO_PATH/scripts/utils/bcbio_prepare_samples.py \-\-out merged \-\-csv project1.csv \-t ipython \-q queue_name \-s lsf \-n 1
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
See more examples at \fI\%parallelize pipeline\fP\&.
.sp
In case of paired reads, the CSV file should contain all files:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
samplename,description,batch,phenotype,sex,variant_regions
file1_R1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
file2_R1.fastq,sample1,batch1,normal,female,/path/to/regions.bed
file1_R2.fastq,sample1,batch1,normal,femela,/path/to/regions.bed
file2_R2.fastq,sample1,batch1,normal,female,/path/to/regions.bed
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The script will try to guess the paired files the same way that \fBbcbio_nextgen.py \-w template\fP does. It would detect paired files if the difference among two files is only
\fB_R1/_R2\fP or \fB\-1/\-2\fP or \fB_1/_2\fP or \fB\&.1/.2\fP
.sp
The output CSV will look like and is compatible with bcbio:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
samplename,description,batch,phenotype,sex,variant_regions
sample1,sample1,batch1,normal,female,/path/to/regions.bed
.ft P
.fi
.UNINDENT
.UNINDENT
.SH SAMPLE INFORMATION
.sp
The sample configuration file defines \fBdetails\fP of each sample to process:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
details:
  \- analysis: variant2
    lane: 1
    description: Example1
    files: [in_pair_1.fq, in_pair_2.fq]
    genome_build: hg19
    algorithm:
      platform: illumina
    metadata:
      batch: Batch1
      sex: female
      platform_unit: flowcell\-barcode.lane
      library: library_type
.ft P
.fi
.UNINDENT
.UNINDENT
.INDENT 0.0
.IP \(bu 2
\fBanalysis\fP Analysis method to use [variant2, RNA\-seq, smallRNA\-seq]
.IP \(bu 2
\fBlane\fP A unique number within the project. Corresponds to the
\fBID\fP parameter in the BAM read group.
.IP \(bu 2
\fBdescription\fP Unique name for this sample, corresponding to the
\fBSM\fP parameter in the BAM read group. Required.
.IP \(bu 2
\fBfiles\fP A list of files to process. This currently supports either a single
end or two paired\-end FASTQ files, or a single BAM file. It does not yet
handle merging BAM files or more complicated inputs.
.IP \(bu 2
\fBgenome_build\fP Genome build to align to, which references a genome
keyword in Galaxy to find location build files.
.IP \(bu 2
\fBalgorithm\fP Parameters to configure algorithm inputs. Options
described in more detail below:
.INDENT 2.0
.IP \(bu 2
\fBplatform\fP Sequencing platform used. Corresponds to the \fBPL\fP
parameter in BAM read groups. Optional, defaults to \fBillumina\fP\&.
.UNINDENT
.IP \(bu 2
\fBmetadata\fP Additional descriptive metadata about the sample:
.INDENT 2.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
\fBbatch\fP defines a group that the sample falls in. We perform
multi\-sample variant calling on all samples with the same batch
name. This can also be a list, allowing specification of a single normal
sample to pair with multiple tumor samples in paired cancer variant
calling (\fBbatch: [MatchWithTumor1, MatchWithTumor2]\fP).
.IP \(bu 2
\fBsex\fP specifies the sample gender used to correctly prepare X/Y
chromosomes. Use \fBmale\fP and \fBfemale\fP or PED style inputs (1=male, 2=female).
.IP \(bu 2
\fBphenotype\fP stratifies cancer samples into \fBtumor\fP and \fBnormal\fP or
case/controls into \fBaffected\fP and \fBunaffected\fP\&. Also accepts PED style
specifications (1=unaffected, 2=affected). CNVkit uses case/control
status to determine how to set background samples for CNV calling.
.IP \(bu 2
\fBprep_method\fP A free text description of the method used in sample
prep. Used to group together samples during CNV calling for background.
This is not required and when not present bcbio assumes all samples in
an analysis use the same method.
.IP \(bu 2
\fBsvclass\fP defines a classification for a sample for use in SV
case/control setups. When set as \fBcontrol\fP will put samples into the
background samples used for normalization.
.IP \(bu 2
\fBped\fP provides a \fI\%PED phenotype file\fP
containing sample phenotype and family information. Template creation uses
this to supplement \fBbatch\fP, \fBsex\fP and \fBphenotype\fP information
provided in the template CSV. GEMINI database creation uses the PED file as input.
.IP \(bu 2
\fBplatform_unit\fP \-\- Unique identifier for sample. Optional, defaults to
\fBlane\fP if not specified.
.IP \(bu 2
\fBlibrary\fP \-\- Name of library preparation used. Optional, empty if not
present.
.IP \(bu 2
\fBvalidate_batch\fP \-\- Specify a batch name to group samples together for
preparing validation plots. This is useful if you want to process samples
in specific batches, but include multiple batches into the same
validation plot.
.IP \(bu 2
\fBvalidate_combine\fP \-\- Specify a batch name to combine multiple samples
into an additional validation summary. Useful for larger numbers of small
samples to evaluate together.
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SH UPLOAD
.sp
The \fBupload\fP section of the sample configuration file describes where to put
the final output files of the pipeline. At its simplest, you can configure
bcbio\-nextgen to upload results to a local directory, for example a folder
shared amongst collaborators or a Dropbox account. You can also configure
it to upload results automatically to a Galaxy instance, to
\fI\%Amazon S3\fP or to iRODS. Here is the simplest configuration, uploading to a local
directory:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
upload:
  dir: /local/filesystem/directory
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
General parameters, always required:
.INDENT 0.0
.IP \(bu 2
\fBmethod\fP Upload method to employ. Defaults to local filesystem.
[filesystem, galaxy, s3, irods]
.IP \(bu 2
\fBdir\fP Local filesystem directory to copy to.
.UNINDENT
.sp
Galaxy parameters:
.INDENT 0.0
.IP \(bu 2
\fBgalaxy_url\fP URL of the Galaxy instance to upload to. Upload
assumes you are able to access a shared directory also present on
the Galaxy machine.
.IP \(bu 2
\fBgalaxy_api_key\fP User API key to access Galaxy: see the
\fI\%Galaxy API\fP documentation.
.IP \(bu 2
\fBgalaxy_library\fP Name of the Galaxy Data Library to upload to. You
can specify this globally for a project in \fBupload\fP or for
individual samples in the sample details section.
.IP \(bu 2
\fBgalaxy_role\fP Specific Galaxy access roles to assign to the
uploaded datasets. This is optional and will default to the access
of the parent data library if not supplied. You can specify this
globally for a project in \fBupload\fP or for individual samples in
the sample details section. The \fI\%Galaxy Admin\fP documentation
has more details about roles.
.UNINDENT
.sp
Here is an example configuration for uploading to a Galaxy instance. This
assumes you have a shared mounted filesystem that your Galaxy instance can
also access:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
upload:
  method: galaxy
  dir: /path/to/shared/galaxy/filesystem/folder
  galaxy_url: http://url\-to\-galaxy\-instance
  galaxy_api_key: YOURAPIKEY
  galaxy_library: data_library_to_upload_to
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Your Galaxy universe_wsgi.ini configuration needs to have
\fBallow_library_path_paste = True\fP set to enable uploads.
.sp
S3 parameters:
.INDENT 0.0
.IP \(bu 2
\fBbucket\fP AWS bucket to direct output.
.IP \(bu 2
\fBfolder\fP A folder path within the AWS bucket to prefix the output.
.IP \(bu 2
\fBregion\fP AWS region name to use. Defaults to us\-east\-1
.IP \(bu 2
\fBreduced_redundancy\fP Flag to determine if we should store S3 data
with reduced redundancy: cheaper but less reliable [false, true]
.UNINDENT
.sp
For S3 access credentials, set the standard environmental variables,
\fBAWS_ACCESS_KEY_ID\fP, \fBAWS_SECRET_ACCESS_KEY\fP, and \fBAWS_DEFAULT_REGION\fP
or use \fI\%IAM access roles\fP
with an instance profile on EC2 to give your instances permission to create
temporary S3 access.
.sp
iRODS parameters:
.INDENT 0.0
.IP \(bu 2
\fBfolder\fP Full directory name within iRODS to prefix the output.
.IP \(bu 2
\fBresource\fP (optional) iRODS resource name, if other than default.
.UNINDENT
.sp
example configuration
.INDENT 0.0
.INDENT 3.5
.INDENT 0.0
.TP
.B upload:
method: irods
dir: ../final
folder: /irodsZone/your/path/
resource: yourResourceName
.UNINDENT
.UNINDENT
.UNINDENT
.sp
Uploads to iRODS depend on a valid installation of the iCommands CLI, and a preconfigured connection
through the \fIiinit\fP command.
.SH GLOBALS
.sp
You can define files used multiple times in the \fBalgorithm\fP section of your
configuration in a top level \fBglobals\fP dictionary. This saves copying and
pasting across the configuration and makes it easier to manually adjust the
configuration if inputs change:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
globals:
  my_custom_locations: /path/to/file.bed
details:
  \- description: sample1
    algorithm:
      variant_regions: my_custom_locations
  \- description: sample2
    algorithm:
      variant_regions: my_custom_locations
.ft P
.fi
.UNINDENT
.UNINDENT
.SH ALGORITHM PARAMETERS
.sp
The YAML configuration file provides a number of hooks to customize
analysis in the sample configuration file. Place these under the
\fBalgorithm\fP keyword.
.SS Alignment
.INDENT 0.0
.IP \(bu 2
\fBplatform\fP Sequencing platform used. Corresponds to the \fBPL\fP
parameter in BAM read groups. Default \(aqIllumina\(aq.
.IP \(bu 2
\fBaligner\fP Aligner to use: [bwa, bowtie, bowtie2, hisat2, minimap2, novoalign, snap,
star, tophat2, false] To use pre\-aligned BAM files as inputs to the pipeline,
set to \fBfalse\fP, which will also skip duplicate marking by default.
Using pre\-aligned inputs requires proper assignment of BAM read
groups and sorting. The \fBbam_clean\fP argument can often resolve issues with
problematic input BAMs.
.IP \(bu 2
\fBbam_clean\fP Clean an input BAM when skipping alignment step. This
handles adding read groups, sorting to a reference genome and
filtering problem records that cause problems with GATK. Options:
.INDENT 2.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
\fBremove_extracontigs\fP \-\- Remove non\-standard chromosomes (for human,
anything that is not chr1\-22,X,Y) from the BAM file. This allows
compatibility when the BAM reference genome has different contigs from
the reference file but consistent ordering for standard chromosomes.
Also fixes the read groups in the BAM file as in \fBfixrg\fP\&. This is
faster than the full \fBpicard\fP cleaning option.
.IP \(bu 2
\fBfixrg\fP \-\- only adjust read groups, assuming everything else in BAM
file is compatible.
.IP \(bu 2
\fBpicard\fP \-\- Picard/GATK based cleaning. Includes read group changes,
fixing of problematic reads and re\-ordering chromosome order to match the
reference genome. To fix misencoded input BAMs with non\-standard scores,
set \fBquality_format\fP to \fBillumina\fP\&.
.UNINDENT
.UNINDENT
.UNINDENT
.IP \(bu 2
\fBbam_sort\fP Allow sorting of input BAMs when skipping alignment
step (\fBaligner\fP set to false). Options are coordinate or
queryname. For additional processing through standard pipelines
requires coordinate sorted inputs. The default is to not do
additional sorting and assume pre\-sorted BAMs.
.IP \(bu 2
\fBdisambiguate\fP For mixed or explant samples, provide a list of
\fBgenome_build\fP identifiers to check and remove from alignment. Currently
supports cleaning a single organism. For example, with \fBgenome_build: hg19\fP
and \fBdisambiguate: [mm10]\fP, it will align to hg19 and mm10, run
disambiguation and discard reads confidently aligned to mm10 and not hg19. Affects
fusion detection when \fBstar\fP is chosen as the aligner. Aligner must be
set to a non false value for this to run.
.IP \(bu 2
\fBalign_split_size\fP: Increase parallelization of alignment. As of 0.9.8,
bcbio will try to determine a useful parameter and you don\(aqt need to set this.
If you manually set it, bcbio will respect your specification. Set to false
to avoid splitting entirely. If set, this defines the number of records to
feed into each independent parallel step (for example, 5000000 = 5 million
reads per chunk). It converts the original inputs into bgzip grabix indexed
FASTQ files, and then retrieves chunks for parallel alignment. Following
alignment, it combines all chunks back into the final merged alignment file.
This allows parallelization at the cost of additional work of preparing inputs
and combining split outputs. The tradeoff makes sense when you have large
files and lots of distributed compute. When you have fewer large multicore
machines this parameter may not help speed up processing.
.IP \(bu 2
\fBquality_format\fP Quality format of FASTQ or BAM inputs [standard, illumina]
.IP \(bu 2
\fBstrandedness\fP For RNA\-seq libraries, if your library is strand
specific, set the appropriate flag from [unstranded, firststrand, secondstrand].
Defaults to unstranded. For dUTP marked libraries, firststrand is correct; for
Scriptseq prepared libraries, secondstrand is correct.
.IP \(bu 2
\fBsave_diskspace\fP Remove align prepped bgzip and split BAM files after
merging into final BAMs. Helps reduce space on limited filesystems during a
run. \fBtools_off: [upload_alignment]\fP may also be useful in conjunction with
this. [false, true]
.UNINDENT
.SS Read trimming
.INDENT 0.0
.IP \(bu 2
\fBtrim_reads\fP Trims low quality or adapter sequences or at the ends of reads
using atropos. \fBadapters\fP and \fBcustom_trim\fP specify the sequences to trim.
For RNA\-seq, it\(aqs recommended to leave as False unless running Tophat2.
For variant calling, we recommend trimming only in special cases where
standard soft\-clipping does not resolve false positive problems. Supports
trimming with 
.nf
\(ga<https://github.com/jdidion/atropos> atropos\(ga_
.fi
 or \fI\%fastp\fP\&. \fBfastp\fP is currently not compatible
with alignment splitting in variant calling and requires \fBalign_split_size:
false\fP\&. The old parameter \fBread_through\fP defaults to using atropos trimming.
[False, atropos, fastp]. Default to False,
.IP \(bu 2
\fBadapters\fP If trimming adapter read through, trim a set of stock
adapter sequences. Allows specification of multiple items in a list,
for example [truseq, polya] will trim both TruSeq adapter sequences
and polyA tails. polyg trimming removes high quality G stretches present in
NovaSeq and NextSeq data. In the small RNA pipeline, bcbio will try to detect
the adapter using DNApi. If you set up this parameter, then bcbio will use this value instead.
Choices: [truseq, illumina, nextera, polya, polyx, polyg, nextera2, truseq2].
.INDENT 2.0
.IP \(bu 2
nextera2: Illumina NEXTera DNA prep kit from NEB
.IP \(bu 2
truseq2: SMARTer Universal Low Input RNA Kit
.UNINDENT
.IP \(bu 2
\fBcustom_trim\fP A list of sequences to trim from the end of reads,
for example: [AAAATTTT, GGGGCCCC]
.IP \(bu 2
\fBmin_read_length\fP Minimum read length to maintain when
\fBread_through\fP trimming set in \fBtrim_reads\fP\&. Defaults to 25.
.IP \(bu 2
\fBtrim_ends\fP Specify values for trimming at ends of reads, using a fast
approach built into fastq preparation. This does not do quality or adapter
trimming but will quickly cut off a defined set of values from either the
5\(aq or 3\(aq end of the first and second reads. Expects a list of 4 values:
[5\(aq trim read1, 3\(aq trim read1, 5\(aq trim read2, 3\(aq trim read2]. Set values
to 0 if you don\(aqt need trimming (ie. \fB[6, 0, 12, 0]\fP will trim 6bp from
the start of read 1 and 12bp from the start of read 2. Only implemented for
variant calling pipelines.
.UNINDENT
.SS Alignment postprocessing
.INDENT 0.0
.IP \(bu 2
\fBmark_duplicates\fP Mark duplicated reads [true, false].
If true, will perform streaming duplicate marking with
\fI\%biobambam\(aqs bammarkduplicates or bamsormadup\fP\&.
Uses \fI\%samblaster\fP as an
alternative if you have paired reads and specifying \fBlumpy\fP as an
\fBsvcaller\fP\&. Defaults to true for variant calling and false for RNA\-seq and
small RNA analyses. Also defaults to false if you\(aqre not doing alignment
(\fBaligner: false\fP).
.IP \(bu 2
\fBrecalibrate\fP Perform base quality score recalibration on the
aligned BAM file, adjusting quality scores to reflect alignments and known
variants. Supports both GATK and Sentieon recalibration.
Defaults to false, no recalibration. [false, gatk, sentieon]
.IP \(bu 2
\fBrealign\fP Perform GATK\(aqs realignment around indels on the aligned BAM
file. Defaults to no realignment since realigning callers like FreeBayes and
GATK HaplotypeCaller handle this as part of the calling process. [false, gatk]
.UNINDENT
.SS Coverage information
.INDENT 0.0
.IP \(bu 2
\fBcoverage_interval\fP Regions covered by sequencing. bcbio calculates this
automatically from alignment coverage information, so you only need to
specify it in the input configuration if you have specific needs or bcbio
does not determine coverage correctly. \fBgenome\fP specifies full genome
sequencing, \fBregional\fP identifies partial\-genome pull down sequencing like
exome analyses, and \fBamplicon\fP is partial\-genome sequencing from
PCR amplicon sequencing. This influences GATK options for filtering: we use
Variant Quality Score Recalibration when set to \fBgenome\fP, otherwise we
apply cutoff\-based soft filters. Also affects copy number calling with CNVkit, structural
variant calling and deep panel calling in cancer samples, where we tune
regional/amplicon analyses to maximize sensitivity.
[genome, regional, amplicon]
.IP \(bu 2
\fBmaxcov_downsample\fP bcbio downsamples whole genome runs with >10x average
coverage to a maximum coverage, avoiding slow runtimes in collapsed repeats
and poly\-A/T/G/C regions. This parameter specified the multiplier of average
coverage to downsample at. For example, \fI200\fP downsamples at 6000x
coverage for a 30x whole genome. Set to \fIfalse\fP or \fI0\fP to disable
downsampling. Current defaults to \fIfalse\fP pending runtime improvements.
.IP \(bu 2
\fBcoverage_depth_min\fP Minimum depth of coverage. When calculating regions to
call in, bcbio may exclude regions with less than this many reads. It is not
a hard filter for variant calling, but rather a guideline for determining
callable regions. It\(aqs primarily useful when trying to call on very low depth
samples. Defaults to 4. Setting lower than 4 will trigger
low\-depth calling options for GATK.
.UNINDENT
.SS Analysis regions
.sp
These BED files define the regions of the genome to analyze and report on.
\fBvariant_regions\fP adjusts regions for small variant (SNP and indel) calling.
\fBsv_regions\fP defines regions for structural variant calling if different than
\fBvariant_regions\fP\&. For coverage\-based quality control metrics, we first use
\fBcoverage\fP if specified, then \fBsv_regions\fP if specified, then
\fBvariant_regions\fP\&. See the section on \fI\%Input file preparation\fP for tips
on ensuring chromosome naming in these files match your reference genome. bcbio
pre\-installs some standard BED files for human analyses. Reference these using
the naming schemes described in the
\fI\%reference data repository\fP\&.
.INDENT 0.0
.IP \(bu 2
\fBvariant_regions\fP BED file of regions to call variants in.
.IP \(bu 2
\fBsv_regions\fP \-\- A specification of regions to target during structural
variant calling. By default, bcbio uses regions specified in
\fBvariant_regions\fP but this allows custom specification for structural
variant calling. This can be a pointer to a BED file or special inputs:
\fBexons\fP for only exon regions, \fBtranscripts\fP for transcript regions (the
min start and max end of exons) or \fBtranscriptsXXXX\fP for transcripts plus a
window of XXXX size around it. The size can be an integer (\fBtranscripts1000\fP)
or exponential (\fBtranscripts1e5\fP). This applies to CNVkit and heterogeneity
analysis.
.IP \(bu 2
\fBcoverage\fP A BED file of regions to check for coverage and completeness in
QC reporting. This can also be a shorthand for a BED file installed by bcbio
(see \fI\%Structural variant calling\fP for options).
.IP \(bu 2
\fBexclude_regions\fP List of regions to remove as part of analysis. This allows
avoidance of slow and potentially misleading regions. This is a list of the
following options:
.INDENT 2.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
\fBpolyx\fP Avoid calling variants in regions of single nucleotide stretches
greater than 50. These can contribute to long variant calling runtimes
when errors in polyX stretches align in high depth to these regions and
take a lot of work to resolve. Since we don\(aqt expect decent resolution
through these types of repeats, this helps avoid extra calculations for
assessing the noise. This is an alternative to trimming polyX from the 3\(aq
ends for reads with \fBtrim_reads\fP and \fBadapters\fP\&. Requires an organism
with a defined \fBpolyx\fP file in genome resources. For structural variant
calling, adding \fBpolyx\fP avoids calling small indels for Manta, where
these can contribute to long runtimes.
.IP \(bu 2
\fBlcr\fP Avoid calling variants in low complexity regions (LCRs).
\fI\%Heng Li\(aqs variant artifacts paper\fP provides
these regions, which cover ~2% of the genome but contribute to a large
fraction of problematic calls due to the difficulty of resolving variants
in repetitive regions. Removal can help facilitate comparisons between
methods and reduce false positives if you don\(aqt need calls in LCRs for your
biological analysis. Requires an organism with a defined \fBlcr\fP file in
genome resources.
.IP \(bu 2
\fBhighdepth\fP Remove high depth regions during variant calling, identified
by collapsed repeats around centromeres in hg19 and GRCh37 as
characterized in the \fI\%ENCODE blacklist\fP\&.
This is on by default for VarDict and FreeBayes whole genome calling to
help with slow runtimes in these regions, and also on for whole genome
structural variant calling to avoid false positives from high depth
repeats.
.IP \(bu 2
\fBaltcontigs\fP Skip calling entirely in alternative and unplaced contigs. This
limits analysis to standard chromosomes \-\- chr1\-22,X,Y,MT for human \-\- to
avoid slowdowns on the additional contigs.
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS Variant calling
.INDENT 0.0
.IP \(bu 2
\fBvariantcaller\fP Variant calling algorithm. Can be a list of
multiple options or false to skip [false, freebayes, gatk\-haplotype,
haplotyper, platypus, mutect, mutect2, scalpel, tnhaplotyper, tnscope,
vardict, varscan, samtools, gatk]
.INDENT 2.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
Paired (typically somatic, tumor\-normal) variant calling is currently
supported by vardict, freebayes, mutect2, mutect (see disclaimer below),
scalpel (indels only), tnhaplotyper (Sentieon), tnscope (Sentieon) and
varscan. See the pipeline documentation on
cancer\-calling for details on pairing tumor and normal samples.
.IP \(bu 2
You can generate both somatic and germline calls for paired tumor\-normal
samples using different sets of callers. The pipeline documentation on
calling somatic\-w\-germline\-variants details how to do this.
.IP \(bu 2
mutect, a SNP\-only caller, can be combined with indels from scalpel or
sid. Mutect operates in both tumor\-normal and tumor\-only modes.
In tumor\-only mode the indels from scalpel will reflect all indels in the sample,
as there is currently no way of separating the germline from somatic indels in
tumor\-only mode.
.UNINDENT
.UNINDENT
.UNINDENT
.IP \(bu 2
.INDENT 2.0
.TP
.B \fBindelcaller\fP For the MuTect SNP only variant caller it is possible to add
calls from an indelcaller such as scalpel, pindel and somatic indel detector
(for Appistry MuTect users only). Currently an experimental option that adds
these indel calls to MuTect\(aqs SNP\-only output. Only one caller supported.
Omit to ignore. [scalpel, pindel, sid, false]
.UNINDENT
.IP \(bu 2
\fBjointcaller\fP Joint calling algorithm, combining variants called with the
specified \fBvariantcaller\fP\&. Can be a list of multiple options but needs to
match with appropriate \fBvariantcaller\fP\&. Joint calling is only needed for
larger input sample sizes (>100 samples), otherwise use standard pooled population\-calling:
.INDENT 2.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
\fBgatk\-haplotype\-joint\fP \fI\%GATK incremental joint discovery\fP with
HaplotypeCaller. Takes individual gVCFs called by \fBgatk\-haploype\fP and
perform combined genotyping.
.IP \(bu 2
\fBfreebayes\-joint\fP Combine freebayes calls using
\fI\%bcbio.variation.recall\fP with recalling at
all positions found in each individual sample. Requires \fBfreebayes\fP
variant calling.
.IP \(bu 2
\fBplatypus\-joint\fP Combine platypus calls using bcbio.variation.recall
with squaring off at all positions found in each individual
sample. Requires \fBplatypus\fP variant calling.
.IP \(bu 2
\fBsamtools\-joint\fP Combine samtools calls using bcbio.variation.recall
with squaring off at all positions found in each individual
sample. Requires \fBsamtools\fP variant calling.
.UNINDENT
.UNINDENT
.UNINDENT
.IP \(bu 2
\fBjoint_group_size\fP Specify the maximum number of gVCF samples to feed into
joint calling. Currently applies to GATK HaplotypeCaller joint calling and
defaults to the GATK recommendation of 200. Larger numbers of samples will
first get combined prior to genotyping.
.IP \(bu 2
\fBploidy\fP Ploidy of called reads. Defaults to 2 (diploid). You can also
tweak specialty ploidy like mitochondrial calling by setting ploidy as a
dictionary. The defaults are:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
ploidy:
  default: 2
  mitochondrial: 1
  female: 2
  male: 1
.ft P
.fi
.UNINDENT
.UNINDENT
.IP \(bu 2
\fBbackground\fP Provide pre\-calculated files to use as backgrounds for
different processes. Organized as a dictionary with individual keys for
different components of the pipeline. You can enter as many or few as needed:
.INDENT 2.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
\fBvariant\fP A VCF file with variants to use as a background
reference during variant calling. For tumor/normal paired calling use this to
supply a panel of normal individuals.
.IP \(bu 2
\fBcnv_reference\fP \fI\%Background reference file\fP
for copy number calling.
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SS Somatic variant calling
.INDENT 0.0
.IP \(bu 2
\fBmin_allele_fraction\fP Minimum allele fraction to detect variants in
heterogeneous tumor samples, set as the float or integer percentage to
resolve (i.e. 10 = alleles in 10% of the sample). Defaults to 10. Specify this
in the tumor sample of a tumor/normal pair.
.UNINDENT
.SS Variant annotation
.INDENT 0.0
.IP \(bu 2
\fBeffects\fP Method used to calculate expected variant effects; defaults to
\fI\%snpEff\fP\&. \fI\%Ensembl variant effect predictor (VEP)\fP is also available
with support for \fI\%dbNSFP\fP  and 
.nf
\(gadbscSNV\(ga_
.fi
 annotation, when downloaded using
datatarget\-install\&. [snpeff, vep, false]
.IP \(bu 2
\fBeffects_transcripts\fP Define the transcripts to use for effect prediction
annotation. Options \fBall\fP: Standard Ensembl transcript list (the default);
\fBcanonical\fP: Report single canonical transcripts (\fB\-canon\fP in snpEff,
\fB\-pick\fP in VEP); \fBcanonical_cancer\fP Canonical transcripts with hand
curated changes for more common cancer transcripts (effects snpEff only).
.IP \(bu 2
\fBvcfanno\fP Configuration files for \fI\%vcfanno\fP, allowing the application of additional
annotations to variant calls. By default, bcbio will try and apply:
.INDENT 2.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
\fBgemini\fP \-\- External population level annotations from \fI\%GEMINI\fP\&. This is only run for human
samples with gemini data installed (datatarget\-install).
.IP \(bu 2
\fBsomatic\fP \-\- Somatic annotations from COSMIC, ClinVar and friends. COSMIC
need a custom installation within bcbio (datatarget\-install). Only
added for tumor or tumor/normal somatic calling.
.IP \(bu 2
\fBrnaedit\fP \-\- RNA editing sites for RNA\-seq variant calling runs.
.UNINDENT
.UNINDENT
.UNINDENT
.sp
bcbio installs pre\-prepared configuration files in
\fBgenomes/build/config/vcfanno\fP or you can specify the full path to a
\fB/path/your/anns.conf\fP and optionally an equivalently
named \fB/path/your/anns.lua\fP file. This value can be a list for multiple
inputs.
.UNINDENT
.SS Structural variant calling
.INDENT 0.0
.IP \(bu 2
\fBsvcaller\fP \-\- List of structural variant callers to use. [lumpy, manta,
cnvkit, seq2c, delly, battenberg]. LUMPY and Manta require paired end reads.
.IP \(bu 2
\fBsvprioritize\fP \-\-  Produce a tab separated summary file of structural
variants in regions of interest. This complements the full VCF files of
structural variant calls to highlight changes in known genes. See the \fI\%paper
on cancer genome prioritization\fP for the
full details. This can be
either the path to a BED file (with \fBchrom start end gene_name\fP, see
\fI\%Input file preparation\fP) or the name
of one of the pre\-installed prioritization files:
.INDENT 2.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
\fBcancer/civic\fP (hg19, GRCh37, hg38) \-\- Known cancer associated genes from
\fI\%CIViC\fP\&.
.IP \(bu 2
\fBcancer/az300\fP (hg19, GRCh37, hg38) \-\- 300 cancer associated genes
contributed by \fI\%AstraZeneca oncology\fP\&.
.IP \(bu 2
\fBcancer/az\-cancer\-panel\fP (hg19, GRCh37, hg38) \-\- A text file of genes in the
AstraZeneca cancer panel. This is only usable for \fBsvprioritize\fP which
can take a list of gene names instead of a BED file.
.IP \(bu 2
\fBactionable/ACMG56\fP \-\- Medically actionable genes from the \fI\%The American College
of Medical Genetics and Genomics\fP
.IP \(bu 2
\fBcoding/ccds\fP (hg38) \-\- \fI\%Consensus CDS (CCDS)\fP
regions with 2bps added to internal introns to capture canonical splice
acceptor/donor sites, and multiple transcripts from a single gene merged
into a single all inclusive gene entry.
.UNINDENT
.UNINDENT
.UNINDENT
.IP \(bu 2
\fBfusion_mode\fP Enable fusion detection in RNA\-seq when using STAR (recommended)
or Tophat (not recommended) as the aligner. OncoFuse is used to summarise the fusions
but currently only supports \fBhg19\fP and \fBGRCh37\fP\&. For explant samples
\fBdisambiguate\fP enables disambiguation of \fBSTAR\fP output [false, true]. This
option is deprecated in favor of \fBfusion_caller\fP\&.
.IP \(bu 2
\fBfusion_caller\fP Specify a standalone fusion caller for fusion mode. Supports
\fBoncofuse\fP for STAR/tophat runs, \fBpizzly\fP and \fBericscript\fP for all runs.
If a standalone caller is specified (i.e. \fBpizzly\fP or \fBericscript\fP ),
fusion detection will not be performed with aligner. \fBoncofuse\fP only
supports human genome builds GRCh37 and hg19. \fBericscript\fP supports human
genome builds GRCh37, hg19 and hg38 after installing the associated fusion
databases (datatarget\-install).
.UNINDENT
.SS HLA typing
.INDENT 0.0
.IP \(bu 2
\fBhlacaller\fP \-\- Perform identification of highly polymorphic HLAs with human
build 38 (hg38). The recommended option is \fBoptitype\fP, using the \fI\%OptiType\fP caller. Also supports using the \fI\%bwa
HLA typing implementation\fP with \fBbwakit\fP
.UNINDENT
.SS Validation
.sp
bcbio pre\-installs standard truth sets for performing validation,
and also allows use of custom local files for assessing reliability of your
runs:
.INDENT 0.0
.IP \(bu 2
\fBvalidate\fP A VCF file of expected variant calls to perform
validation and grading of small variants (SNPs and indels) from the pipeline.
This provides a mechanism to ensure consistency of calls against
a known set of variants, supporting comparisons to genotyping
array data or reference materials.
.IP \(bu 2
\fBvalidate_regions\fP A BED file of regions to evaluate small variant calls in. This
defines specific regions covered by the \fBvalidate\fP VCF  file.
.IP \(bu 2
\fBsvvalidate\fP \-\- Dictionary of call types and pointer to BED file of known
regions. For example: \fBDEL: known_deletions.bed\fP does deletion based
validation of outputs against the BED file.
.UNINDENT
.sp
Each option can be either the path to a local file, or a partial path to a file
in the pre\-installed truth sets. For instance, to validate an NA12878 run
against the \fI\%Genome in a Bottle\fP truth set:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
validate: giab\-NA12878/truth_small_variants.vcf.gz
validate_regions: giab\-NA12878/truth_regions.bed
svvalidate:
  DEL: giab\-NA12878/truth_DEL.bed
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
follow the same naming schemes for small variants, regions and
different structural variant types. bcbio has the following validation materials
for germline validations:
.INDENT 0.0
.IP \(bu 2
\fBgiab\-NA12878\fP \-\-  \fI\%Genome in a Bottle\fP for NA12878, a Caucasian sample.
Truth sets: small_variants, regions, DEL; Builds: GRCh37, hg19, hg38
.IP \(bu 2
\fBgiab\-NA24385\fP \-\-  \fI\%Genome in a Bottle\fP for NA24385, an Ashkenazic Jewish
sample.
Truth sets: small_variants, regions; Builds: GRCh37, hg19, hg38
.IP \(bu 2
\fBgiab\-NA24631\fP \-\-  \fI\%Genome in a Bottle\fP for NA24631, a Chinese sample.
Truth sets: small_variants, regions; Builds: GRCh37, hg19, hg38
.IP \(bu 2
\fBgiab\-NA12878\-crossmap\fP \-\-  \fI\%Genome in a Bottle\fP for NA12878 converted to hg38 with CrossMap. Truth sets: small_variants,
regions, DEL; Builds: hg38
.IP \(bu 2
\fBgiab\-NA12878\-remap\fP \-\-  \fI\%Genome in a Bottle\fP for NA12878 converted to hg38 with Remap. Truth sets: small_variants,
regions, DEL; Builds: hg38
.IP \(bu 2
\fBplatinum\-genome\-NA12878\fP \-\- \fI\%Illumina Platinum Genome\fP for NA12878. Truth sets:
small_variants, regions; Builds: hg19, hg38
.UNINDENT
.sp
and for cancer validations:
.INDENT 0.0
.IP \(bu 2
\fBgiab\-NA12878\-NA24385\-somatic\fP \-\- A
\fI\%sequenced NA12878/NA24385 mixture\fP
providing a somatic\-like truth set for detecting low frequency events. Build:
Truth sets: small_variants, regions. Builds: GRCh37, hg38
.IP \(bu 2
\fBdream\-syn3\fP \-\- Synthetic dataset 3 from the \fI\%ICGC\-TCGA DREAM mutation
calling challenge\fP\&.
Truth sets: small_variants, regions, DEL, DUP, INV, INS. Builds: GRCh37.
.IP \(bu 2
\fBdream\-syn4\fP \-\- Synthetic dataset 4 from the \fI\%ICGC\-TCGA DREAM mutation
calling challenge\fP\&.
Truth sets: small_variants, regions, DEL, DUP, INV. Builds: GRCh37.
.IP \(bu 2
\fBdream\-syn3\-crossmap\fP \-\- Synthetic dataset 3 from the \fI\%ICGC\-TCGA DREAM mutation
calling challenge\fP
converted to human build 38 coordinates with CrossMap.
Truth sets: small_variants, regions, DEL, DUP, INV, INS. Builds: hg38.
.IP \(bu 2
\fBdream\-syn4\-crossmap\fP \-\- Synthetic dataset 4 from the \fI\%ICGC\-TCGA DREAM mutation
calling challenge\fP
converted to human build 38 coordinates with CrossMap.
Truth sets: small_variants, regions, DEL, DUP, INV. Builds: hg38.
.UNINDENT
.sp
For more information on the hg38 truth set preparation see the work on \fI\%validation on build
38 and conversion of human build 37 truth sets to build 38\fP\&. The \fI\%installation recipes\fP contain
provenance details about the origins of the installed files.
.SS UMIs
.sp
Unique molecular identifiers (UMIs) are short random barcodes used to tag
individual molecules and avoid amplification biased. Both
single cell RNA\-seq and variant calling support UMIs. For variant calling,
\fI\%fgbio\fP collapses sequencing
duplicates for each UMI into a single consensus read prior to running
re\-alignment and variant calling. This requires \fBmark_duplicates: true\fP (the
default) since it uses position based duplicates and UMI tags for collapsing
duplicate reads into consensus sequences.
.sp
To help with preparing fastq files with UMIs bcbio provides a script
\fBbcbio_fastq_umi_prep.py\fP\&. This handles two kinds of UMI barcodes:
.INDENT 0.0
.IP \(bu 2
Separate UMIs: it converts reads output by an Illumina as 3
files (read 1, read 2, and UMIs).
.IP \(bu 2
Duplex barcodes with tags incorporated at the 5\(aq end of read 1 and read 2
.UNINDENT
.sp
In both cases, these get converted into paired reads with UMIs in the fastq
names, allowing specification of \fBumi_type: fastq_name\fP in your bcbio YAML
configuration. The script runs on a single set of files or autopairs an entire
directory of fastq files. To convert a directory with separate UMI files:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_fastq_umi_prep.py autopair \-c <cores_to_use> <list> <of> <fastq> <files>
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
To convert duplex barcodes present on the ends of read 1 and read 2:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_fastq_umi_prep.py autopair \-c <cores_to_use> \-\-tag1 5 \-\-tag2 5 <list> <of> <fastq> <files>
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Configuration options for UMIs:
.INDENT 0.0
.IP \(bu 2
\fBumi_type\fP The UMI/cellular barcode scheme used for your data. For single
cell RNA sequencing, supports [harvard\-indrop, harvard\-indrop\-v2, 10x_v2, icell8, surecell].
For variant analysis with UMI based consensus calling, supports either
\fBfastq_name\fP with UMIs in read names or the path to a fastq file with
UMIs for each aligned read.
.UNINDENT
.sp
You can adjust the \fI\%fgbio default options\fP
by adjusting \fI\%Resources\fP\&. The most common change is modifying the
minimum number of reads as input to consensus sequences. This default to 1 to
avoid losing reads but you can set to larger values for high depth panels:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
resources:
  fgbio:
    options: [\-\-min\-reads, 2]
.ft P
.fi
.UNINDENT
.UNINDENT
.SS RNA sequencing
.INDENT 0.0
.IP \(bu 2
\fBtranscript_assembler\fP If set, will assemble novel genes and transcripts and
merge the results into the known annotation. Can have multiple values set in a
list. Supports [\(aqcufflinks\(aq, \(aqstringtie\(aq].
.IP \(bu 2
\fBtranscriptome_align\fP If set to True, will also align reads to just the
transcriptome, for use with EBSeq and others.
.IP \(bu 2
\fBexpression_caller\fP A list of optional expression callers to turn on.
Supports [\(aqcufflinks\(aq, \(aqexpress\(aq, \(aqstringtie\(aq, \(aqsailfish\(aq, \(aqdexseq\(aq, \(aqkallisto\(aq]. Salmon
and count based expression estimation are run by default.
.IP \(bu 2
\fBfusion_caller\fP A list of optional fusion callers to turn on. Supports
[oncofuse, pizzly].
.IP \(bu 2
\fBvariantcaller\fP Variant calling algorithm to call variants on RNA\-seq data. Supports [gatk\-haplotype] or [vardict].
.IP \(bu 2
\fBspikein_fasta\fP A FASTA file of spike in sequences to quantitate.
.IP \(bu 2
\fBbcbiornaseq\fP A dictionary of key\-value pairs to be passed as options to bcbioRNAseq. Currently supports \fIorganism\fP as a key and takes the latin name of the genome used (\fImus musculus\fP, \fIhomo sapiens\fP, etc) and \fIinteresting_groups\fP which will be used to color quality control plots.:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
bcbiornaseq:
  organism: homo sapiens
  interesting_groups: [treatment, genotype, etc, etc]
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.sp
You will need to also turn on \fBbcbiornaseq\fP by turning it on via \fBtools_on: [bcbiornaseq]\fP\&.
.SS Fast RNA\-seq
.INDENT 0.0
.IP \(bu 2
\fBtranscriptome_fasta\fP An optional FASTA file of transcriptome sequences to
quantitate rather than using bcbio installed transcriptome sequences.
.UNINDENT
.SS Single\-cell RNA sequencing
.INDENT 0.0
.IP \(bu 2
\fBminimum_barcode_depth\fP Cellular barcodes with less than this many reads
assigned to them are discarded (default 10,000).
.IP \(bu 2
\fBcellular_barcodes\fP A single file or a list of one or two files which have
valid cellular barcodes. Provide one file if there is only one barcode and
two files if the barcodes are split. If no file is provided, all cellular
barcodes passing the \fBminimum_barcode_depth\fP filter are kept.
.IP \(bu 2
\fBtranscriptome_fasta\fP An optional FASTA file of transcriptome sequences to
quantitate rather than the bcbio installed version.
.IP \(bu 2
\fBtranscriptome_gtf\fP An optional GTF file of the transcriptome to quantitate,
rather than the bcbio installed version. This is recommended for single\-cell
RNA\-sequencing experiments.
.IP \(bu 2
\fBsinglecell_quantifier\fP Quantifier to use for single\-cell RNA\-sequencing.
Supports \fBrapmap\fP or \fBkallisto\fP\&.
.IP \(bu 2
\fBcellular_barcode_correction\fP Number of errors to correct in identified
cellular barcodes. Requires a set of known barcodes to be passed with the
\fBcellular_barcodes\fP option. Defaults to 1. Set to 0 to turn off
error correction.
.IP \(bu 2
\fBsample_barcodes\fP A text file with one barcode per line of expected sample
barcodes.
.UNINDENT
.SS smallRNA sequencing
.INDENT 0.0
.IP \(bu 2
\fBadapters\fP The 3\(aq end adapter that needs to be remove. For NextFlex protocol you can add
\fBadapters: ["4N", "$3PRIME_ADAPTER"]\fP\&. For any other options you can use
resources: \fBatropos:options:["\-u 4", "\-u \-4"]\fP\&.
.IP \(bu 2
\fBspecies\fP 3 letters code to indicate the species in mirbase classification (i.e. hsa for human).
.IP \(bu 2
\fBaligner\fP Currently STAR is the only one tested although bowtie can be used as well.
.IP \(bu 2
\fBexpression_caller\fP A list of expression callers to turn on: trna, seqcluster, mirdeep2, mirge (read smallRNA\-seq to learn how to set up bcbio to run mirge)
.IP \(bu 2
\fBtranscriptome_gtf\fP An optional GTF file of the transcriptome to for seqcluster.
.IP \(bu 2
\fBspikein_fasta\fP A FASTA file of spike in sequences to quantitate.
.IP \(bu 2
\fBumi_type: \(aqqiagen_smallRNA_umi\(aq\fP Support of Qiagen UMI small RNAseq protocol.
.UNINDENT
.SS ChIP sequencing
.INDENT 0.0
.IP \(bu 2
\fBpeakcaller\fP bcbio only accepts \fB[macs2]\fP
.IP \(bu 2
\fBaligner\fP Currently \fBbowtie2\fP is the only one tested
.IP \(bu 2
The \fBphenotype\fP and \fBbatch\fP tags need to be set under \fBmetadata\fP in the config YAML file. The \fBphenotype\fP tag will specify the chip (\fBphenotype: chip\fP) and input samples (\fBphenotype: input\fP). The \fBbatch\fP tag will specify the input\-chip pairs of samples for example, \fBbatch: pair1\fP\&. Same input can be used for different chip samples giving a list of distinct values: \fBbatch: [sample1, sample2]\fP\&.
.IP \(bu 2
\fBchip_method\fP: currently supporting standard CHIP\-seq (TF or broad regions using \fIchip\fP) or ATAC\-seq (\fIatac\fP). Paramters will change depending on the option to get the best possible results. Only macs2 supported for now.
.UNINDENT
.sp
You can pass different parameters for \fBmacs2\fP adding to \fI\%Resources\fP:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
resources:
  macs2:
    options: ["\-\-broad"]
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Quality control
.INDENT 0.0
.IP \(bu 2
\fBmixup_check\fP Detect potential sample mixups. Currently supports
\fI\%qSignature\fP\&.
\fBqsignature_full\fP runs a larger analysis while \fBqsignature\fP runs a smaller
subset on chromosome 22.  [False, qsignature, qsignature_full]
.IP \(bu 2
\fBkraken\fP Turn on kraken algorithm to detect possible contamination. You can
add \fBkraken: minikraken\fP and it will use a minimal database to detect possible
\fI\%contaminants\fP\&. As well, you can point to a \fI\%custom database\fP directory and
kraken will use it. You will find the results in the \fIqc\fP directory. You need
to use \fI\-\-datatarget kraken\fP during installation to make the minikraken
database available.
.IP \(bu 2
\fBpreseq\fP Accepts \fBlc_extrap\fP or \fBc_curve\fP, and runs Preseq <\fI\%http://smithlabresearch.org/software/preseq\fP>\(ga_,
a tool that predicts the yield for future experiments. By default, it runs 300
steps of estimation using the segment length of 100000. The default extrapolation
limit for \fBlc_extrap\fP is 3x of the reads number. You can override the parameters
\fBseg_len\fP, \fBsteps\fP, \fBextrap_fraction\fP using the \fI\%Resources\fP: section:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
resources:
  preseq:
    extrap_fraction: 5
    steps: 500
    seg_len: 5000
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
And you can also set \fBextrap\fP and \fBstep\fP parameters directly, as well as provide any
other command line option via \fBoptions\fP:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
resources:
  preseq:
    extrap: 10000000
    step: 30000
    options: ["\-D"]
.ft P
.fi
.UNINDENT
.UNINDENT
.IP \(bu 2
bcbio uses \fI\%MultiQC\fP to combine QC output for all
samples into a single report file. If you need to tweak configuration settings
from bcbio defaults, you can use \fI\%Resources\fP\&. For instance to
display read counts with full numbers instead of the default millions:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
resources:
  multiqc:
    options: ["\-\-cl_config", "\(aqread_count_multiplier: 1\(aq"]
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
or as thousands:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
resources:
  multiqc:
    options: ["\-\-cl_config", "\(aq{read_count_multiplier: 0.001, read_count_prefix: K}\(aq"]
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.SS Post\-processing
.INDENT 0.0
.IP \(bu 2
\fBarchive\fP Specify targets for long term archival. \fBcram\fP removes fastq
names and does 8\-bin compression of BAM files into \fI\%CRAM format\fP\&.
\fBcram\-lossless\fP generates CRAM files without changes to quality scores or
fastq name. Default: [] \-\- no archiving.
.UNINDENT
.SS Changing bcbio defaults
.sp
bcbio provides some hints to change default behavior be either turning specific
defaults on or off, with \fBtools_on\fP and \fBtools_off\fP\&. Both can be
lists with multiple options:
.INDENT 0.0
.IP \(bu 2
\fBtools_off\fP Specify third party tools to skip as part of analysis
pipeline. Enables turning off specific components of pipelines if not
needed:
.INDENT 2.0
.IP \(bu 2
\fBgatk4\fP Use older GATK versions (3.x) for GATK commands like BQSR,
HaplotypeCaller and VQSR. By default bcbio includes GATK4 and uses it.
.IP \(bu 2
\fBvqsr\fP turns off variant quality score recalibration for all samples.
.IP \(bu 2
\fBbwa\-mem\fP forces use of original \fBbwa aln\fP alignment. Without this, we
use bwa mem with 70bp or longer reads. \fBfastqc\fP turns off quality control
FastQC usage.
.IP \(bu 2
\fBlumpy\-genotype\fP skip genotyping for Lumpy samples, which can be slow in
the case of many structural variants.
.IP \(bu 2
\fBseqcluster\fP turns off use of seqcluster tool in srnaseq pipeline.
.IP \(bu 2
\fBtumoronly\-prioritization\fP turns off attempted removal of germline
variants from tumor only calls using external population data sources like
ExAC and 1000 genomes.
.IP \(bu 2
\fBvardict_somatic_filter\fP disables running a post calling filter for
VarDict to remove variants found in normal samples. Without
\fBvardict_somatic_filter\fP in paired analyses no soft filtering of germline
variants is performed but all high quality variants pass.
.IP \(bu 2
\fBupload_alignment\fP turns off final upload of large alignment files.
.IP \(bu 2
\fBpbgzip\fP turns off use of bgzip with multiple threads.
.IP \(bu 2
\fBcoverage_qc\fP turns off calculation of coverage statistics with
samtools\-stats and picard.
.UNINDENT
.IP \(bu 2
\fBtools_on\fP Specify functionality to enable that is off by default:
.INDENT 2.0
.IP \(bu 2
\fBqualimap\fP runs \fI\%Qualimap\fP (qualimap
uses downsampled files and numbers here are an estimation of 1e7 reads.).
.IP \(bu 2
\fBqualimap_full\fP runs Qualimap with full bam files but it may be slow.
.IP \(bu 2
\fBdamage_filter\fP annotates low frequency somatic calls in INFO/DKFZBias for
DNA damage artifacts using \fI\%DKFZBiasFilter\fP\&.
.IP \(bu 2
\fBtumoronly_germline_filter\fP applies a \fBLowPriority\fP filter to tumor\-only calls
that match population germline databases. The default is to just apply a tag
\fBEPR\fP (external prioritization) that flags variants present in external
databases. Anything missing a \fBpass\fP here is a likely germline.
.IP \(bu 2
\fBvqsr\fP makes GATK try quality score recalibration for variant filtration,
even for smaller sample sizes.
.IP \(bu 2
\fBsvplots\fP adds additional coverage and summary plots for CNVkit and detected
ensemble variants.
.IP \(bu 2
\fBbwa\-mem\fP forces use of bwa mem even for samples with less than 70bp
reads.
.IP \(bu 2
\fBgvcf\fP forces gVCF output for callers that support it (GATK
HaplotypeCaller, FreeBayes, Platypus).
.IP \(bu 2
\fBvep_splicesite_annotations\fP enables the use of the MaxEntScan and
SpliceRegion plugin for VEP. Both optional plugins add extra splice site
annotations.
.IP \(bu 2
\fBgemini\fP Create a \fI\%GEMINI database\fP of variants for
downstream query using the new vcfanno and
vcf2db approach.
.IP \(bu 2
\fBgemini_orig\fP Create a \fI\%GEMINI database\fP
of variants using the older GEMINI loader. Only works for GRCh37 and hg19.
.IP \(bu 2
\fBgemini_allvariants\fP enables all variants to go into GEMINI, not only
those that pass filters.
.IP \(bu 2
\fBvcf2db_expand\fP decompresses and expands the genotype columns in the
vcfanno prepared GEMINI databases, enabling standard SQL queries on
genotypes and depths.
.IP \(bu 2
\fBbnd\-genotype\fP enables genotyping of breakends in Lumpy calls, which
improves accuracy but can be slow.
.IP \(bu 2
\fBlumpy_usecnv\fP uses input calls from CNVkit as prior evidence to Lumpy
calling.
.IP \(bu 2
\fBcoverage_perbase\fP calculates per\-base coverage depth for analyzed variant
regions.
.IP \(bu 2
\fBbcbiornaseq\fP loads a bcbioRNASeq object for use with \fI\%bcbioRNASeq\fP\&.
.UNINDENT
.UNINDENT
.SS parallelization
.INDENT 0.0
.IP \(bu 2
\fBnomap_split_size\fP Unmapped base pair regions required to split
analysis into blocks. Creates islands of mapped reads surrounded by
unmapped (or N) regions, allowing each mapped region to run in
parallel. (default: 250)
.IP \(bu 2
\fBnomap_split_targets\fP Number of target intervals to attempt to
split processing into. This picks unmapped regions evenly spaced
across the genome to process concurrently. Limiting targets prevents
a large number of small targets. (default: 200 for standard runs,
20 for CWL runs)
.UNINDENT
.SS Ensemble variant calling
.sp
In addition to single method variant calling, we support calling with
multiple calling methods and consolidating into a final Ensemble
callset.
.sp
The recommended method to do this uses a simple majority rule ensemble
classifier that builds a final callset based on the intersection of calls. It
selects variants represented in at least a specified number of callers:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
variantcaller: [mutect2, varscan, freebayes, vardict]
ensemble:
  numpass: 2
  use_filtered: false
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This example selects variants present in 2 out of the 4 callers and does not use
filtered calls (the default behavior). Because of the difficulties of producing
a unified FORMAT/genotype field across callers, the ensemble outputs contains a
mix of outputs from the different callers. It picks a representative sample in
the order of specified caller, so in the example above would have a MuTect2 call
if present, otherwise a VarScan call if present, otherwise a FreeBayes call.
This may require custom normalization scripts during post\-processing when using
these calls. \fI\%bcbio.variation.recall\fP implements this approach, which handles
speed and file sorting limitations in the \fI\%bcbio.variation\fP approach.
.sp
This older approach uses the \fI\%bcbio.variation\fP
toolkit to perform the consolidation. An example configuration in the
\fBalgorithm\fP section is:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
variantcaller: [gatk, freebayes, samtools, gatk\-haplotype, varscan]
ensemble:
  format\-filters: [DP < 4]
  classifier\-params:
    type: svm
  classifiers:
    balance: [AD, FS, Entropy]
    calling: [ReadPosEndDist, PL, PLratio, Entropy, NBQ]
  trusted\-pct: 0.65
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The \fBensemble\fP set of parameters configure how to combine calls from
the multiple methods:
.INDENT 0.0
.IP \(bu 2
\fBformat\-filters\fP A set of filters to apply to variants before
combining. The example removes all calls with a depth of less than
4.
.IP \(bu 2
\fBclassifier\-params\fP Parameters to configure the machine learning
approaches used to consolidate calls. The example defines an SVM
classifier.
.IP \(bu 2
\fBclassifiers\fP Groups of classifiers to use for training and
evaluating during machine learning. The example defines two set of
criteria for distinguishing reads with allele balance issues and
those with low calling support.
.IP \(bu 2
\fBtrusted\-pct\fP Define threshold of variants to include in final
callset. In the example, variants called by more than 65% of the
approaches (4 or more callers) pass without being requiring SVM
filtering.
.UNINDENT
.SH RESOURCES
.sp
The \fBresources\fP section allows customization of locations of programs
and memory and compute resources to devote to them:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
resources:
  bwa:
    cores: 12
    cmd: /an/alternative/path/to/bwa
  samtools:
    cores: 16
    memory: 2G
  gatk:
    jvm_opts: ["\-Xms2g", "\-Xmx4g"]
.ft P
.fi
.UNINDENT
.UNINDENT
.INDENT 0.0
.IP \(bu 2
\fBcmd\fP Location of an executable. By default, we assume executables
are on the path.
.IP \(bu 2
\fBcores\fP Cores to use for multi\-proccessor enabled software. This is how
many cores will be allocated per job. For example if you are running
10 samples and passed \-n 40 to bcbio\-nextgen and the step you are running
has cores: 8 set, a maximum of five samples will run in parallel, each using
8 cores.
.IP \(bu 2
\fBjvm_opts\fP Specific memory usage options for Java software. For
memory usage on programs like GATK, specify the maximum usage per
core. On multicore machines, that\(aqs machine\-memory divided by cores.
This avoids memory errors when running multiple jobs simultaneously,
while the framework will adjust memory up when running multicore
jobs.
.IP \(bu 2
\fBmemory\fP Specify the memory per core used by a process. For programs
where memory control is available, like \fBsamtools sort\fP,
this limits memory usage. For other programs this is an estimate of
usage, used by memory\-management to avoid over\-scheduling
memory. Always specify this as the memory usage for a single core,
and the pipeline handles scaling this when a process uses multiple
cores.
.IP \(bu 2
\fBkeyfile\fP Specify the location of a program specific key file or license
server, obtained from a third party software tool. Supports licenses for
\fI\%novoalign\fP and \fI\%Sentieon\fP\&. For more complex Sentieon setups
this can also be a dictionary of environmental variables:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
resources:
  sentieon:
    keyfile:
      SENTIEON_LICENSE_SERVER: 100.100.100.100:8888
      SENTIEON_AUTH_MECH: XXX
      SENTIEON_AUTH_DATA: signature
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.SS Temporary directory
.sp
You also use the resource section to specify system specific parameters like
global temporary directories:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
resources:
  tmp:
    dir: /scratch
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This is useful on cluster systems with large attached local storage, where you
can avoid some shared filesystem IO by writing temporary files to the local
disk. When setting this keep in mind that the global temporary disk must have
enough space to handle intermediates. The space differs between steps but
generally you\(aqd need to have 2 times the largest input file per sample and
account for samples running simultaneously on multiple core machines.
.sp
To handle clusters that specify local scratch space with an environmental
variable, bcbio will resolve environmental variables like:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
resources:
  tmp:
    dir: $YOUR_SCRATCH_LOCATION
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Sample or run specific resources
.sp
To override any of the global resource settings in a sample specific manner, you
write a resource section within your sample YAML configuration. For example, to
create a sample specific temporary directory and pass a command line option to
novoalign, write a sample resource specification like:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
\- description: Example
  analysis: variant2
  resources:
    novoalign:
      options: ["\-o", "FullNW", "\-\-rOQ"]
    tmp:
      dir: tmp/sampletmpdir
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
To adjust resources for an entire run, you can add this resources specification
at the top level of your sample YAML:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
details:
  \- description: Example
resources:
  default:
    cores: 16
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Logging directory
.sp
By default, bcbio writes the logging\-output directory to \fBlog\fP in the
main directory of the run. You can configure this to a different location in
your \fBbcbio\-system.yaml\fP with:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
log_dir: /path/to/logs
.ft P
.fi
.UNINDENT
.UNINDENT
.SH INPUT FILE PREPARATION
.sp
Input files for supplementing analysis, like \fBvariant_regions\fP need to match
the specified reference genome. A common cause of confusion is the two
chromosome naming schemes for human genome build 37: UCSC\-style in hg19 (chr1,
chr2) and Ensembl/NCBI style in GRCh37 (1, 2). To help avoid some of this
confusion, in build 38 we only support the commonly agreed on chr1, chr2 style.
.sp
It\(aqs important to ensure that the chromosome naming in your input files match
those in the reference genome selected. bcbio will try to detect this and
provide helpful errors if you miss it.
.sp
To convert chromosome names, you can use \fI\%Devon Ryan\(aqs collection of chromosome
mappings\fP as an input to sed.
For instance, to convert hg19 chr\-style coordinates to GRCh37:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
wget \-\-no\-check\-certificate \-qO\- http://raw.githubusercontent.com/dpryan79/ChromosomeMappings/master/GRCh37_UCSC2ensembl.txt \e
   | awk \(aq{if($1!=$2) print "s/^"$1"/"$2"/g"}\(aq > remap.sed
sed \-f remap.sed original.bed > final.bed
.ft P
.fi
.UNINDENT
.UNINDENT
.SH GENOME CONFIGURATION FILES
.sp
Each genome build has an associated \fBbuildname\-resources.yaml\fP
configuration file which contains organism specific naming and
resource files. bcbio\-nextgen expects a resource file present next to
the genome FASTA file. \fI\%Example genome configuration files\fP are available, and
automatically installed for natively supported genomes. Create these
by hand to support additional organisms or builds.
.sp
The major sections of the file are:
.INDENT 0.0
.IP \(bu 2
\fBaliases\fP \-\- Names for third\-party programs used as part of the
analysis, since naming expectations can differ between software
programs.
.IP \(bu 2
\fBvariation\fP \-\- Supporting data files for variant analysis. For human
analyses, the dbSNP and training files are from the \fI\%GATK resource bundle\fP\&.
These are inputs into the training models for
recalibration. The automated \fI\%CloudBioLinux\fP data scripts will
download and install these in the variation subdirectory relative to
the genome files.
.IP \(bu 2
\fBrnaseq\fP \-\- Supporting data files for RNA\-seq analysis. The
automated installer and updater handles retrieval and installation
of these resources for supported genome builds.
.IP \(bu 2
\fBsrnaseq\fP \-\- Supporting data files for smallRNA\-seq analysis. Same as in
rnaseq, the automated installer and updater handle this for supported genome
builds.
.UNINDENT
.sp
By default, we place the \fBbuildname\-resources.yaml\fP files next to
the genome FASTA files in the reference directory. For custom setups,
you specify an alternative directory in the ref:\fIconfig\-resources\fP
section of your \fBbcbio_system.yaml\fP file:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
resources:
  genome:
    dir: /path/to/resources/files
.ft P
.fi
.UNINDENT
.UNINDENT
.SH REFERENCE GENOME FILES
.sp
The pipeline requires access to reference genomes, including the raw
FASTA sequence and pre\-built indexes for aligners. The automated installer
will install reference files and indexes for commonly used genomes (see the
upgrade\-install documentation for command line options).
.sp
For human genomes, we recommend using build 38 (hg38). This is \fI\%fully supported
and validated\fP in bcbio, and
corrects a lot of issues in the previous build 37. We use the \fI\%1000 genomes
distribution\fP
which includes HLAs and decoy sequences. For human build 37, GRCh37 and hg19, we
use the 1000 genome references provided in the \fI\%GATK resource bundle\fP\&. These
differ in chromosome naming: hg19 uses chr1, chr2, chr3 style contigs while
GRCh37 uses 1, 2, 3. They also differ \fI\%slightly in content\fP:
GRCh37 has masked \fI\%Pseudoautosomal regions\fP on chromosome Y
allowing alignment to these regions on chromosome X.
.sp
You can use pre\-existing data and reference indexes by pointing bcbio\-nextgen at
these resources. We use the \fI\%Galaxy .loc files\fP approach to describing the
location of the sequence and index data, as described in
data\-requirements\&. This does not require a Galaxy installation since the
installer sets up a minimal set of \fB\&.loc\fP files. It finds these by identifying
the root \fBgalaxy\fP directory, in which it expects a \fBtool\-data\fP sub\-directory
with the \fB\&.loc\fP files. It can do this in two ways:
.INDENT 0.0
.IP \(bu 2
Using the directory of your \fBbcbio\-system.yaml\fP\&. This is the
default mechanism setup by the automated installer and requires no additional
work.
.IP \(bu 2
From the path specified by the \fBgalaxy_config\fP option in your
\fBbcbio\-system.yaml\fP\&. If you\(aqd like to move your system YAML file,
add the full path to your \fBgalaxy\fP directory here. This is useful if you
have a pre\-existing Galaxy installation with reference data.
.UNINDENT
.sp
To manually make genomes available to bcbio\-nextgen, edit the individual
\fB\&.loc\fP files with locations to your reference and index genomes. You need to
edit \fBsam_fa_indices.loc\fP to point at the FASTA files and then any genome
indexes corresponding to aligners you\(aqd like to use (for example:
\fBbwa_index.loc\fP for bwa and \fBbowtie2_indices.loc\fP for bowtie2). The database
key names used (like \fBGRCh37\fP and \fBmm10\fP) should match those used in the
\fBgenome_build\fP of your sample input configuration file.
.SH ADDING CUSTOM GENOMES
.sp
\fBbcbio_setup_genome.py\fP will help you to install a custom genome and apply all changes needed
to the configuration files. It needs the genome in FASTA format, and the annotation file
in GTF or GFF3 format. It can create index for all aligners used by bcbio. Moreover, it will create
the folder \fIrnaseq\fP to allow you run the RNAseq pipeline without further configuration.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_setup_genome.py \-f genome.fa \-g annotation.gtf \-i bowtie2 star seq \-n Celegans \-b WBcel135
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If you want to add smallRNA\-seq data files, you will need to add the 3 letters code of mirbase
for your genome (i.e hsa for human) and the GTF file for the annotation of smallRNA data.
Here you can use the same file than the transcriptome if no other available.
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_setup_genome.py \-f genome.fa \-g annotation.gtf \-i bowtie2 star seq \-n Celegans \-b WBcel135 \-\-species cel \-\-srna_gtf another_annotation.gtf
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
To use that genome just need to configure your YAML files as:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
genome_build: WBcel135
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Effects prediction
.sp
To perform variant calling and predict effects in a custom genome you\(aqd have to
manually download and link this into your installation. First find the snpEff
genome build:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ snpEff databases | grep Lactobacillus | grep pentosus
Lactobacillus_pentosus_dsm_20314                                Lactobacillus_pentosus_dsm_20314                                              ENSEMBL_BFMPP_32_179            http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip
Lactobacillus_pentosus_kca1                                     Lactobacillus_pentosus_kca1                                                   ENSEMBL_BFMPP_32_179            http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
then download to the appropriate location:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ cd /path/to/bcbio/genomes/Lacto/Lactobacillus_pentosus
$ mkdir snpEff
$ cd snpEff
$ wget http://downloads.sourceforge.net/project/snpeff/databases/v4_3/snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip
$ unzip snpEff_v4_3_ENSEMBL_BFMPP_32_179.zip
$ find . \-name "Lactobacillus_pentosus_dsm_20314"
 ./home/pcingola/snpEff/data/Lactobacillus_pentosus_dsm_20314
$ mv ./home/pcingola/snpEff/data/Lactobacillus_pentosus_dsm_20314 .
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
finally add to your genome configuration file
(\fBseq/Lactobacillus_pentosus\-resources.yaml\fP):
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
aliases:
  snpeff: Lactobacillus_pentosus_dsm_20314
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
For adding an organism not present in snpEff, please see this
\fI\%mailing list discussion\fP\&.
.sp
The pipeline runs in parallel in two different ways:
.INDENT 0.0
.IP \(bu 2
multiple cores \-\- Analyses will run in parallel using multiple cores
on a single machine. This requires only the \fBmultiprocessing\fP
Python library, included by default with most Python installations.
.IP \(bu 2
parallel messaging \-\- This allows scaling beyond the cores
available on a single machine, and requires multiple machines
with a shared filesystem like standard cluster environments.
Machine to machine communication occurs via messaging, using the
\fI\%IPython parallel\fP framework.
.UNINDENT
.SH TUNING CORE AND MEMORY USAGE
.sp
bcbio has two ways to specify core usage, helping provide options for
parallelizing different types of processes:
.INDENT 0.0
.IP \(bu 2
Total available cores: specified with \fB\-n\fP on the commandline, this tells
bcbio how many total cores to use. This applies either to a local multicore
run or a distributed job.
.IP \(bu 2
Maximum cores to use for multicore processing of individual jobs. You specify
this in the \fBresource\fP section of either a sample YAML file
(sample\-resources) or \fBbcbio_system.yaml\fP\&. Ideally you specify this
in the \fBdefault\fP section (along with memory usage). For example, this would
specify that processes using multiple cores can get up to 16 cores with 2G of
memory per core:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
resources:
  default:
    memory: 2G
    cores: 16
    jvm_opts: ["\-Xms750m", "\-Xmx2000m"]
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.sp
bcbio uses these settings, along with memory requests, to determine how to
partition jobs. For example, if you had \fB\-n 32\fP and \fBcores: 16\fP for a run on
a single 32 core machine, this would run two simultaneous bwa mapping jobs using
16 cores each.
.sp
Memory specifications (both in \fBmemory\fP and \fBjvm_opts\fP) are per\-core. bcbio
takes care of adjusting this memory to match the cores used. In the example
above, if bcbio was running a 16 core java process, it would use 32Gb of memory
for the JVM, adjusting \fBXmx\fP and \fBXms\fP to match cores used. Internally bcbio
looks at the memory and CPU usage on a machine and matches your configuration
options to the available system resources. It will scale down core requests if
memory is limiting, avoiding over\-scheduling resources during the run. You
ideally want to set both \fBmemory\fP and \fBjvm_opts\fP to match the average memory
per core on the run machine and adjust upwards if this does not provide enough
memory for some processes during the run.
.sp
For single machine runs with a small number of samples, you generally want to
set \fBcores\fP close to or equal the number of total cores you\(aqre allocating to
the job with \fB\-n\fP\&. This will allow individual samples to process as fast as
possible and take advantage of multicore software.
.sp
For distributed jobs, you want to set \fBcores\fP to match the available cores on
a single node in your cluster, then use \fB\-n\fP as a multiple of this to
determine how many nodes to spin up. For example, \fBcores: 16\fP and \fB\-n 64\fP
would try to make four 16 core machines available for analysis.
.SH MULTIPLE CORES
.sp
Running using multiple cores only requires setting the \fB\-n\fP
command line flag:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_nextgen.py bcbio_sample.yaml \-t local \-n 12
.ft P
.fi
.UNINDENT
.UNINDENT
.SH IPYTHON PARALLEL
.sp
\fI\%IPython parallel\fP provides a distributed framework for performing
parallel computation in standard cluster environments. The
bcbio\-nextgen setup script installs both IPython and \fI\%pyzmq\fP, which
provides Python bindings for the \fI\%ZeroMQ\fP messaging library. The only
additional requirement is that the work directory where you run the
analysis is accessible to all processing nodes. This is typically
accomplished with a distributed file system like
\fI\%NFS\fP, \fI\%Gluster\fP or \fI\%Lustre\fP\&.
.sp
Run an analysis using ipython for parallel execution:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_nextgen.py bcbio_sample.yaml \-t ipython \-n 12 \-s lsf \-q queue
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The \fB\-s\fP flag specifies a type of scheduler to use \fB(lsf, sge, torque, slurm, pbspro)\fP\&.
.sp
The \fB\-q\fP flag specifies the queue to submit jobs to.
.sp
The \fB\-n\fP flag defines the total number of cores to use on the
cluster during processing. The framework will select the appropriate
number of cores and type of cluster (single core versus multi\-core) to
use based on the pipeline stage (see the internals\-parallel
section in the internals documentation for more details). For
multiple core steps, the number of cores to use for programs like
\fBbwa\fP, \fBnovoalign\fP and \fBgatk\fP comes from the
config\-resources section of the configuration.
Ensure the \fBcores\fP specification matches the physical cores
available on machines in your cluster, and the pipeline will divide
the total cores specified by \fB\-n\fP into the appropriate number of
multicore jobs to run.
.sp
The pipeline default parameters assume a system with minimal time to
obtain processing cores and consistent file system accessibility. These
defaults allow the system to fail fast in the case of cluster issues
which need diagnosis. For running on shared systems with high resource
usage and potential failures due to intermittent cluster issues, there
are turning parameters that increase resiliency. The \fB\-\-timeout\fP
flag specifies the numbers of minutes to wait for a cluster to start
up before timing out. This defaults to 15 minutes. The \fB\-\-retries\fP
flag specify the number of times to retry a job on failure. In systems
with transient distributed file system hiccups like lock errors or disk
availability, this will provide recoverability at the cost of
resubmitting jobs that may have failed for reproducible reasons.
.sp
Finally, the \fB\-r resources\fP flag specifies resource options to pass along
to the underlying queue scheduler. This currently supports SGE\(aqs
\fB\-l\fP parameter, Torque\(aqs \fB\-l\fP parameter and LSF and SLURM native flags. This allows specification
or resources to the scheduler (see the \fI\%qsub man page\fP). You may specify multiple
resources, so \fB\-r mem=4g \-r ct=01:40:00\fP
translates to \fB\-l mem=4g \-l ct=01:40:00\fP when passed to \fBqsub\fP or
\fB\-r "account=a2010002" \-r "timelimit=04:00:00"\fP when using SLURM, for
instance. SLURM and Torque support specification of an account parameter with
\fB\-r account=your_name\fP, which IPython transfers into \fB\-A\fP\&.
.sp
SGE supports special parameters passed using resources to help handle the
heterogeneity of possible setups.
.sp
Specify an \fI\%SGE parallel environment\fP
that supports using multiple cores on a
single node with \fB\-r pename=your_pe\fP\&. Since this setup is system specific it
is hard to write general code for find a suitable environment. Specifically,
when there are multiple usable parallel environments, it will select the first
one which may not be correct. Manually specifying it with a \fBpename=\fP flag to
resources will ensure correct selection of the right environment. If you\(aqre
administering a grid engine cluster and not sure how to set this up you\(aqd
typically want a \fBsmp\fP queue using \fBallocation_rule: $pe_slots\fP like in this
\fI\%example pename configuration\fP
or \fI\%smp template\fP\&.
.sp
SGE has other specific flags you may want to tune, depending on your setup. To
specify an advanced reservation with the \fB\-ar\fP flag, use \fB\-r ar=ar_id\fP\&. To
specify an alternative memory management model instead of \fBmem_free\fP use \fB\-r
memtype=approach\fP\&. It is further recommended to configure \fBmem_free\fP (or any
other chosen memory management model) as a consumable, requestable resource in
SGE to prevent overfilling hosts that do not have sufficient memory per slot.
This can be done in two steps. First, launch \fBqmon\fP as an admin, select
\fBComplex Configuration\fP in qmon, click on \fBmem_free\(ga, under the
\(ga\(gaConsumable\fP dialog select \fBJOB\fP (instead of \fBYES\fP or \fBNO\fP) and finally
click \fBModify\fP for the changes to take effect. Secondly, for each host in the
queue, configure \fBmem_free\fP as a complex value. If a host called \fBmyngshost\fP
has 128GB of RAM, the corresponding command would be \fBqconf \-mattr exechost
complex_values mem_free=128G myngshost\fP
.sp
There are also special \fB\-r\fP resources parameters to support pipeline configuration:
.INDENT 0.0
.IP \(bu 2
\fB\-r conmem=4\fP \-\- Specify the memory for the controller process, in Gb. This
currently applies to SLURM processing and defaults to 4Gb.
.IP \(bu 2
\fB\-r minconcores=2\fP \-\- The minimum number of cores to use for the controller
process. The controller one works on a single core but this can help in
queues where you can only specify multicore jobs.
.IP \(bu 2
\fB\-r mincores=16\fP \-\- Specify the minimum number of cores to batch together
for parallel single core processes like variant calling. This will run
multiple processes together under a single submission to allow sharing of
resources like memory, which is helpful when a small percentage of the time a
process like variant calling will use a lot of memory. By default, bcbio will
calculate \fBmincores\fP based on specifications for multicore calling so this
doesn\(aqt normally require a user to set.
.UNINDENT
.SH TROUBLESHOOTING
.SS Diagnosing job failures
.sp
Parallel jobs can often terminate with rather generic failures like any of the
following:
.INDENT 0.0
.IP \(bu 2
\fBjoblib/parallel.py", ... TypeError: init() takes at least 3 arguments (2 given)\fP
.IP \(bu 2
\fBMultiprocessing exception:\fP
.IP \(bu 2
\fBCalledProcessError: Command \(aq<command line that failed>\fP
.UNINDENT
.sp
These errors unfortunately don\(aqt help diagnose the problem, and you\(aqll likely
see the actual error triggering this generic exception earlier in the run. This
error can often be hard to find due to parallelization.
.sp
If you run into a confusing failure like this, the best approach is to re\-run
with a single core:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_nextgen.py your_input.yaml \-n 1
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
which should produce a more helpful debug message right above the failure.
.sp
It\(aqs also worth re\-trying the failed command line outside of bcbio to look for
errors. You can find the failing command by cross\-referencing the error message
with command lines in \fBlog/bcbio\-nextgen\-commands.log\fP\&. You may have to change
temporary directories (\fBtx/tmp**\fP) in some of the job outputs. Reproducing the
error outside of bcbio is a good first step to diagnosing and fixing the
underlying issue.
.SS No parallelization where expected
.sp
This may occure if the current execution is a re\-run of a previous project:
.INDENT 0.0
.IP \(bu 2
Files in \fBcheckpoints_parallel/*.done\fP tell bcbio not to parallelize already
executed pipeline tasks. This makes restarts faster by avoiding re\-starting a
cluster (when using distributed runs) for finished stages. If that behaviour
is not desired for a task, removing the checkpoint file will get things
parallelizing again.
.IP \(bu 2
If the processing of a task is nearly finished the last jobs of this task will be
running and bcbio will wait for those to finish.
.UNINDENT
.SS IPython parallelization problems
.sp
Networking problems on clusters can prevent the IPython parallelization
framework from working properly. Be sure that the compute nodes on your
cluster are aware of IP addresses that they can use to communicate
with each other (usually these will be local IP addresses). Running:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
python \-c \(aqimport socket; print socket.gethostbyname(socket.gethostname())\(aq
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Should return such an IP address (as opposed to localhost). This can be
fixed by adding an entry to the hosts file.
.sp
The line:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
host\-ip hostname
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
where \fBhost\-ip\fP is replaced by the actual IP address of the machine
and \fIhostname\fP by the machine\(aqs own hostname, should be aded to \fB/etc/hosts\fP
on each compute node. This will probably involve contacting your local
cluster administrator.
.SH MEMORY MANAGEMENT
.sp
The memory information specified in the system configuration
config\-resources enables scheduling of memory intensive
processes. The values are specified on a \fImemory\-per\-core\fP basis and
thus bcbio\-nextgen handles memory scheduling by:
.INDENT 0.0
.IP \(bu 2
\fI\%Determining available cores and memory per machine\fP
.IP \(bu 2
Calculating the memory and core usage.
The system configuration config\-resources contains the
expected core and memory usage of external programs.
.IP \(bu 2
Adjusting the specified number of total cores to avoid
over\-scheduling memory. This allows running programs with more than
the available memory per core without getting out of memory system
errors.
.IP \(bu 2
Passing total memory usage along to schedulers. The SLURM, SGE,
Torque and PBSPro schedulers use this information to allocate memory to
processes, avoiding issues with other scheduled programs using
available memory on a shared machine.
.UNINDENT
.sp
As a result of these calculations, the cores used during processing
will not always correspond to the maximum cores provided in the input
\fB\-n\fP parameter. The goal is rather to intelligently maximize cores and
memory while staying within system resources. Note that memory
specifications are for a single core, and the pipeline takes care of
adjusting this to actual cores used during processing.
.SH DETERMINING AVAILABLE CORES AND MEMORY PER MACHINE
.sp
bcbio automatically tries to determine the total available memory and cores per
machine for balancing resource usage. For multicore runs, it retrieves total
memory from the current machine. For parallel runs, it spawns a job on the queue
and extracts the system information from that machine. This expects a
homogeneous set of machines within a cluster queue. You can see the determined
cores and total memory in \fBprovenance/system\-ipython\-queue.yaml\fP\&.
.sp
For heterogeneous clusters or other cases where bcbio does not correctly
identify available system resources, you can manually set the machine cores and
total memory in the \fBresource\fP section of either a sample YAML file
(sample\-resources) or \fBbcbio_system.yaml\fP:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
resources:
  machine:
    memory: 48.0
    cores: 16
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The memory usage is total available on the machine in Gb, so this specifies that
individual machines have 48Gb of total memory and 16 cores.
.SH TUNING SYSTEMS FOR SCALE
.sp
bcbio\-nextgen scales out on clusters including hundreds of cores and is
stress tested on systems with 1000 simultaneous processes. Scaling up
often requires system specific tuning to handle simultaneous
processes. This section collects useful tips and tricks for managing
scaling issues.
.SS Open file handles
.sp
A common failure mode is having too many open file handles. This
error report can come from the IPython infrastructure logs as ZeroMQ
attempts to open sockets, or from the processing logs as third party
software gets file handles. You can check your available file handles
with \fBulimit \-a | grep open\fP\&. Setting open file handle limits is
open system and cluster specific and below are tips for specific
setups.
.sp
In addition to open file handle limits (\fBulimit \-n\fP) large processes may also
run into issues with available max user processes (\fBulimit \-u\fP). Some systems
set a low soft limit (\fBulimit \-Su\fP) like 1024 but a higher hard limit
(\fBulimit \-Hu\fP), allowing adjustment without root privileges. The IPython
controllers and engines do this automatically, but the main \fBbcbio_nextgen.py\fP
driver process cannot. If this scheduler puts this process on the same node as
worker processes, you may run into open file handle limits due to work happening
on the workers. To fix this, manually set \fBulimit \-u a_high_number\fP as part of
the submission process for the main process.
.sp
For a Ubuntu system, edit \fB/etc/security/limits.conf\fP to set the
soft and hard \fBnofile\fP descriptors, and edit
\fB/etc/pam.d/common\-session\fP to add \fBpam_limits.so\fP\&. See
\fI\%this blog post\fP for more details.
.sp
For CentOS/RedHat systems, edit \fB/etc/security/limits.conf\fP and
\fB/etc/security/limits.d/90\-nproc.conf\fP to \fI\%increase maximum open files and
user limits\fP\&.
.sp
SGE needs configuration at the qmaster level. Invoke \fBqconf \-mconf\fP
from a host with admin privileges, and edit \fBexecd_params\fP:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
execd_params                 S_DESCRIPTORS=20000
.ft P
.fi
.UNINDENT
.UNINDENT
.SS IO and Network File Systems
.sp
bcbio\-nextgen makes use of distributed network file systems to manage
sharing large files between compute nodes. While we strive to minimize
disk\-based processing by making use of pipes, the pipeline still has a
major IO component. To help manage IO and network bottlenecks, this
section contains pointers on deployments and benchmarking. Please
contribute your tips and thoughts.
.INDENT 0.0
.IP \(bu 2
Harvard and Dell: See the \(aqDistributed File Systems\(aq section of our
\fI\%post on scaling bcbio\-nextgen\fP for details about the setup within
\fI\%Harvard FAS Research Computing\fP and thoughts on scaling and
hardware. We also collaborate with Dell to
test the pipeline on \fI\%Dell\(aqs Active Infrastructure for Life Sciences\fP\&.
We found the biggest initial factor limiting scaling was network
bandwidth between compute and storage nodes.
.UNINDENT
.sp
bcbio runs with \fI\%Common Workflow Language (CWL)\fP
compatible parallelization software. bcbio generates a CWL workflow from a
\fI\%standard bcbio sample YAML description file\fP
and any tool that supports CWL input can run the workflow. CWL\-based tools do the
work of managing files and workflows, and bcbio performs the biological analysis
using either a Docker container or a local installation.
.SH CURRENT STATUS
.sp
bcbio creates CWL for alignment, small variant calls (SNPs and indels), coverage
assessment, HLA typing, quality control and structural variant calling. It
generates a \fI\%CWL v1.0.2\fP compatible workflow.
The actual biological code execution during runs works with either a \fI\%bcbio
docker container\fP or a \fI\%local
installation of bcbio\fP\&.
.sp
The implementation includes bcbio\(aqs approaches to splitting and batching
analyses. At the top level workflow, we parallelize by samples. Using
sub\-workflows, we split fastq inputs into sections for parallel alignment over
multiple machines following by merging. We also use sub\-workflows, along with
CWL records, to batch multiple samples and run in parallel. This enables pooled
and tumor/normal cancer calling with parallelization by chromosome regions based
on coverage calculations.
.INDENT 0.0
.INDENT 2.5
[image: Variant calling overview]
[image]
.UNINDENT
.UNINDENT
.sp
bcbio supports these CWL\-compatible tools:
.INDENT 0.0
.IP \(bu 2
\fI\%Cromwell\fP \-\- multicore local runs and
distributed runs on HPC systems with shared filesystems and schedulers like
SLURM, SGE and PBSPro.
.IP \(bu 2
\fI\%Arvados\fP \-\- a hosted platform that runs on top of
parallel cloud environments. We include an example below of running on the
\fI\%public Curoverse\fP instance running on
\fI\%Microsoft Azure\fP\&.
.IP \(bu 2
\fI\%DNANexus\fP \-\- a hosted platform running
distributed jobs on cloud environments, working with both AWS and Azure.
.IP \(bu 2
\fI\%rabix bunny\fP \-\- multicore local runs.
.IP \(bu 2
\fI\%Toil\fP \-\- parallel local and
distributed cluster runs on schedulers like SLURM, SGE and PBSPro.
.IP \(bu 2
\fI\%Seven Bridges\fP \-\- parallel distributed
analyses on the Seven Bridges platform and \fI\%Cancer Genomics Cloud\fP\&.
.IP \(bu 2
\fI\%cwltool\fP \-\- a single
core analysis engine, primarily used for testing.
.UNINDENT
.sp
We plan to continue to expand CWL support to include more components of bcbio,
and also need to evaluate the workflow on larger, real life analyses. This
includes supporting additional CWL runners. We\(aqre working on evaluating
\fI\%Galaxy/Planemo\fP for integration
with the Galaxy community.
.SH INSTALLATION
.sp
\fI\%bcbio\-vm\fP installs all
dependencies required to generate CWL and run bcbio, along with supported CWL
runners. There are two install choices, depending on your usage of bcbio:
running CWL with a existing local bcbio install, or running with containers.
.SS Install bcbio\-vm with a local bcbio
.sp
To run bcbio without using containers, first \fI\%install bcbio\fP
and make it available in your path. You\(aqll need both the bcbio code and tools.
To only run the tests and bcbio validations, you don\(aqt need a full data
installation so can install with \fB\-\-nodata\fP\&.
.sp
To then install bcbio\-vm, add the \fB\-\-cwl\fP flag to the install:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_nextgen.py upgrade \-\-cwl
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Adding this to any future upgrades will also update the bcbio\-vm wrapper code
and tools.
.sp
When you begin running your own analysis and need the data available,
pre\-prepare your bcbio data directory with \fBbcbio_nextgen.py upgrade \-\-data
\-\-cwl\fP\&.
.SS Install bcbio\-vm with containers
.sp
If you don\(aqt have an existing local bcbio installation and want to run with CWL
using the tools and data embedded in containers, you can do a stand along
install of just bcbio\-vm. To install using \fI\%Miniconda\fP and \fI\%bioconda packages\fP on Linux:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
wget http://repo.continuum.io/miniconda/Miniconda2\-latest\-Linux\-x86_64.sh
export TARGETDIR=~/install/bcbio\-vm/anaconda
export BINDIR=/usr/local
bash Miniconda2\-latest\-Linux\-x86_64.sh \-b \-p $TARGETDIR
$TARGETDIR/bin/conda install \-\-yes \-c conda\-forge \-c bioconda bcbio\-nextgen
$TARGETDIR/bin/conda install \-\-yes \-c conda\-forge \-c bioconda bcbio\-nextgen\-vm
ln \-s $TARGETDIR/bin/bcbio_vm.py $BINDIR/bcbio_vm.py
ln \-s $TARGETDIR/bin/conda $BINDIR/bcbiovm_conda
ln \-s $TARGETDIR/bin/python $BINDIR/bcbiovm_python
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
In the above commands, the \fIbcbio\-vm\fP install goes in \fB$TARGETDIR\fP\&.
The example is in your home directory but set to anywhere you have space.
Also, as an alternative to symbolic linking to a \fB$BINDIR\fP, you can
add the install bin directory to your PATH:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
export PATH=$TARGETDIR/bin:$PATH
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This install includes bcbio\-nextgen libraries, used in generating CWL and
orchestrating runs, but is not a full bcbio installation. It requires
\fI\%Docker\fP present on your
system this is all you need to get started running examples, since the CWL
runners will pull in Docker containers with the bcbio tools.
.SH GETTING STARTED
.sp
To make it easy to get started, we have pre\-built CWL descriptions that
use test data. These run in under 5 minutes on a local machine and
don\(aqt require a bcbio installation if you have Docker available on
your machine:
.INDENT 0.0
.IP 1. 3
Download and unpack the \fI\%test repository\fP:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
wget \-O test_bcbio_cwl.tar.gz https://github.com/bcbio/test_bcbio_cwl/archive/master.tar.gz
tar \-xzvpf test_bcbio_cwl.tar.gz
cd test_bcbio_cwl\-master/somatic
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 2. 3
Run the analysis using either Cromwell, Rabix bunny or Toil. If you have Docker
available on your machine, the runner will download the correct \fI\%bcbio
container\fP and you don\(aqt need to
install anything else to get started. If you have an old version of the
container you want to update to the latest with \fBdocker pull
quay.io/bcbio/bcbio\-vc\fP\&. There are shell scripts that provide the command
lines for running:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
bash run_cromwell.sh
bash run_bunny.sh
bash run_toil.sh
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Or you can run directly using the \fBbcbio_vm.py\fP wrappers:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py cwlrun cromwell somatic\-workflow
bcbio_vm.py cwlrun toil somatic\-workflow
bcbio_vm.py cwlrun bunny somatic\-workflow
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Thes wrappers automatically handle temporary directories, permissions,
logging and re\-starts. If running without Docker, use a \fI\%local installation of
bcbio\fP
add \fB\-\-no\-container\fP to the commands in the shell scripts.
.UNINDENT
.SH GENERATING CWL FOR INPUT TO A TOOL
.sp
The first step in running your analysis project in bcbio is to generate CWL. If
you\(aqre already familiar with bcbio, the \fI\%process of preparing information about
your sample inputs and analysis\fP
are almost identical:
.INDENT 0.0
.IP \(bu 2
A \fI\%standard bcbio sample configuration file\fP
defining the samples. This can either be a full prepared YAML file or a
\fI\%template file and CSV with sample data\fP\&.
.IP \(bu 2
A \fBbcbio_system.yaml\fP file defining the system environment for running the
program. This includes the resource specification with \fI\%cores and memory per
core for your machines\fP\&.
For choosing cores and memory per cores, you generally want to set this to
match the parameters of a single machine either for a local run or on a
cluster.
.sp
In addition to \fI\%resources\fP
specifications, the bcbio system file now also includes paths to the
reference biodata and optionally input file directories if you want to avoid
specifying full paths to your inputs in the \fBbcbio_vm.py template\fP command.
bcbio will recursively look up file locations within those \fBinputs\fP, and
this has the advantage of working identically for non\-local file locations.
Here is an example for a 16 core machine with 3.5Gb of memory per core:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
local:
  ref: /path/to/bcbio/genomes/Hsapiens
  inputs:
    \- /path/to/input/files
resources:
  default:
    cores: 16
    memory: 3500M
    jvm_opts: [\-Xms1g, \-Xmx3500m]
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.sp
Generate CWL with:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py template \-\-systemconfig bcbio_system.yaml template.yaml samples.csv [optional list of fastq or BAM inputs]
bcbio_vm.py cwl \-\-systemconfig bcbio_system.yaml samples/config/samples.yaml
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
producing a \fBsample\-workflow\fP output directory with the CWL.
.sp
On a first CWL generation run with a new genome, this process will run for a
longer time as it needs to make your reference compatible with CWL. This
includes creating single tar.gz files from some reference directories so they
can get passed to CWL steps where they\(aqll get unpacked. This process only
happens a single time and keeps unpacked versions so your reference setup is
compatible with both old bcbio IPython and new CWL runs.
.sp
You can now run this with any CWL compatible runner and the \fBbcbio_vm.py
cwlrun\fP wrappers standardize running across multiple tools in different
environments.
.SH RUNNING WITH CROMWELL (LOCAL, HPC, CLOUD)
.sp
The \fI\%Cromwell\fP workflow management system runs
bcbio either locally on a single machine or distributed on a cluster using a
scheduler like SLURM, SGE or PBSPro.
.sp
To run a bcbio CWL workflow locally using Docker:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py cwlrun cromwell sample\-workflow
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If you want to run from a locally installed bcbio add \fB\-\-no\-container\fP to the
commandline.
.sp
To run distributed on a SLURM cluster:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py cwlrun cromwell sample\-workflow \-\-no\-container \-q your_queue \-s slurm \-r timelimit=0\-12:00
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Tweak scheduler parameters using the
\fI\%same options as the older bcbio IPython approach\fP\&.
.sp
To control the resources used Cromwell, set \fI\-\-joblimit\fP to the allowed jobs
allocated concurrently. This isn\(aqt total cores used, but rather the number of jobs
either locally or remotely scheduled concurrently. Since CWL steps are
heterogeneous and use only cores necessary for that job, the total cores used
will max out at joblimit times maximum cores for an individual process. Setting
this helps avoid over\-committing jobs to a shared scheduler during highly
parallel processes like variant calling.
.sp
Cromwell can also run directly on cloud resources: docs\-cloud\-gcp\&.
.SH RUNNING WITH TOIL (LOCAL, HPC)
.sp
The \fI\%Toil pipeline management system\fP
runs CWL workflows in parallel on a local machine, on a cluster or at AWS.
.sp
To run a bcbio CWL workflow locally with Toil using Docker:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py cwlrun toil sample\-workflow
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If you want to run from a locally installed bcbio add \fB\-\-no\-container\fP to the
commandline.
.sp
To run distributed on a Slurm cluster:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py cwlrun toil sample\-workflow \-\- \-\-batchSystem slurm
.ft P
.fi
.UNINDENT
.UNINDENT
.SH RUNNING ON ARVADOS (HOSTED CLOUD)
.sp
bcbio generated CWL workflows run on \fI\%Arvados\fP and these
instructions detail how to run on the \fI\%Arvdos public instance\fP\&. \fI\%Arvados cwl\-runner\fP comes pre\-installed with \fI\%bcbio\-vm\fP\&.
We have a publicly accessible project, called \fI\%bcbio_resources\fP
that contains the latest Docker images, test data and genome references you can
use for runs.
.sp
Retrieve API keys from the \fI\%Arvados public
instance\fP\&. Login, then go to \fI\%\(aqUser
Icon \-> Personal Token\(aq\fP\&.
Copy and paste the commands given there into your shell. You\(aqll
specifically need to set \fBARVADOS_API_HOST\fP and \fBARVADOS_API_TOKEN\fP\&.
.sp
To run an analysis:
.INDENT 0.0
.IP 1. 3
Create a new project from the web interface (Projects \-> Add a new
project). Note the project ID from the URL of the project (an
identifier like \fBqr1hi\-j7d0g\-7t73h4hrau3l063\fP).
.IP 2. 3
Upload reference data to Arvados Keep. Note the genome collection
UUID. You can also use the existing genomes pre\-installed in the
\fBbcbio_resources\fP project if using the public Arvados playground:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
arv\-put \-\-name testdata_genomes \-\-project\-uuid $PROJECT_ID testdata/genomes/hg19
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 3. 3
Upload input data to Arvados Keep. Note the collection UUID:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
arv\-put \-\-name testdata_inputs \-\-project\-uuid $PROJECT_ID testdata/100326_FC6107FAAXX testdata/automated testdata/reference_material
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 4. 3
Create an Arvados section in a \fBbcbio_system.yaml\fP file specifying
locations to look for reference and input data. \fBinput\fP can be one or more
collections containing files or associated files in the original sample YAML:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
arvados:
  reference: qr1hi\-4zz18\-kuz1izsj3wkfisq
  input: [qr1hi\-j7d0g\-h691y6104tlg8b4]
resources:
  default: {cores: 4, memory: 2G, jvm_opts: [\-Xms750m, \-Xmx2500m]}
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 5. 3
Generate the CWL to run your samples. If you\(aqre using multiple input
files with a \fI\%CSV metadata file and template\fP
start with creation of a configuration file:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py template \-\-systemconfig bcbio_system_arvados.yaml testcwl_template.yaml testcwl.csv
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
To generate the CWL from the system and sample configuration files:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py cwl \-\-systemconfig bcbio_system_arvados.yaml testcwl/config/testcwl.yaml
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 6. 3
In most cases, Arvados should directly pick up the Docker images you need
from the public bcbio_resources project in your instance. If you need to
manually add to your project, you can copy latest bcbio Docker image into
your project from bcbio_resources using \fI\%arv\-copy\fP\&. You\(aqll need to find
the UUID of \fBquay.io/bcbio/bcbio\-vc\fP and \fBarvados/jobs\fP:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
arv\-copy $JOBS_ID \-\-project\-uuid $PROJECT_ID \-\-src qr1hi \-\-dst qr1hi
arv\-copy $BCBIO_VC_ID \-\-project\-uuid $PROJECT_ID \-\-src qr1hi \-\-dst qr1hi
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
or import local Docker images to your Arvados project:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
docker pull arvados/jobs:1.0.20180216164101
arv\-keepdocker \-\-project $PROJECT_ID \-\- arvados/jobs 1.0.20180216164101
docker pull quay.io/bcbio/bcbio\-vc
arv\-keepdocker \-\-project $PROJECT_ID \-\- quay.io/bcbio/bcbio\-vc latest
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 7. 3
Run the CWL on the Arvados public cloud using the Arvados cwl\-runner:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py cwlrun arvados arvados_testcwl\-workflow \-\- \-\-project\-uuid $PROJECT_ID
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.SH RUNNING ON DNANEXUS (HOSTED CLOUD)
.sp
bcbio runs on the \fI\%DNAnexus platform\fP by converting
bcbio generated CWL into DNAnexus workflows and apps using
\fI\%dx\-cwl\fP\&. This describes the process
using the bcbio workflow app (bcbio\-run\-workflow) and
\fI\%bcbio workflow applet (bcbio_resources:/applets/bcbio\-run\-workflow)\fP
in the public \fI\%bcbio_resources\fP
project, Both are \fI\%regularly updated and maintained on the DNAnexus
platform\fP\&. Secondarily, we
also show how to install and create workflows locally for
additional control and debugging.
.INDENT 0.0
.IP 0. 4
Set some useful environmental variables:
.INDENT 4.0
.IP \(bu 2
\fB$PNAME\fP \-\- The name of the project you\(aqre analyzing. For convenience
here we keep this the same for your local files and remote DNAnexus
project, although that does not have to be true.
.IP \(bu 2
\fB$DX_AUTH_TOKEN\fP \-\- The DNAnexus authorization token for access, used for
the \fBdx\fP command line tool and bcbio scripts.
.IP \(bu 2
\fB$DX_PROJECT_ID\fP \-\- The DNAnexus GUID identifier for your project
(similar to \fBproject\-F8Q7fJj0XFJJ3XbBPQYXP4B9\fP). You can get this from
\fBdx env\fP after creating/selecting a project in steps 1 and 2.
.UNINDENT
.IP 1. 4
Create an analysis project:
.INDENT 4.0
.INDENT 3.5
.sp
.nf
.ft C
dx new project $PNAME
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 2. 4
Upload sample data to the project:
.INDENT 4.0
.INDENT 3.5
.sp
.nf
.ft C
dx select $PNAME
dx upload \-p \-\-path /data/input *.bam
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 3. 4
Create a bcbio system YAML file with projects, locations of files and
desired core and memory usage for jobs. bcbio uses the core and memory
specifications to determine machine instance types to use:
.INDENT 4.0
.INDENT 3.5
.sp
.nf
.ft C
dnanexus:
  project: PNAME
  ref:
    project: bcbio_resources
    folder: /reference_genomes
  inputs:
    \- /data/input
    \- /data/input/regions
resources:
  default: {cores: 8, memory: 3000M, jvm_opts: [\-Xms1g, \-Xmx3000m]}
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 4. 4
Create a bcbio sample CSV file referencing samples to run. The files can be
relative to the \fBinputs\fP directory specified above; bcbio will search
recursively for files, so you don\(aqt need to specify full paths if your file
names are unique. Start with a sample specification:
.INDENT 4.0
.INDENT 3.5
.sp
.nf
.ft C
samplename,description,batch,phenotype
file1.bam,sample1,b1,tumor
file2.bam,sample2,b1,normal
file3.bam,sample3,b2,tumor
file4.bam,sample4,b2,normal
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 5. 4
Pick a template file that describes the \fI\%bcbio configuration\fP
variables. You can define parameters either globally (in the template) file
or by sample (in the csv) using the \fI\%standard bcbio templating\fP\&.
An example template for GATK4 germline variant calling is:
.INDENT 4.0
.INDENT 3.5
.sp
.nf
.ft C
details:
 \- algorithm:
     aligner: bwa
     variantcaller: gatk\-haplotype
   analysis: variant2
   genome_build: hg38
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 6. 4
Supply the three inputs (\fBbcbio_system.yaml\fP, \fBproject.csv\fP and
\fBtemplate.yaml\fP) to the either the bcbio\-run\-workflow app or applet. This
example uses a specific version of the bcbio app for full reproducibility;
any future re\-runs will always use the exact same versioned tools and
workflows. You can do this using the web interface or via the command line with a small
script like:
.INDENT 4.0
.INDENT 3.5
.sp
.nf
.ft C
TEMPLATE=germline
APP_VERSION=0.0.2
FOLDER=/bcbio/$PNAME
dx select "$PROJECT"
dx mkdir \-p $FOLDER
for F in $TEMPLATE\-template.yaml $PNAME.csv bcbio_system\-dnanexus.yaml
do
        dx rm \-a /$FOLDER/$F || true
        dx upload \-\-path /$FOLDER/ $F
done
dx ls $FOLDER
dx rm \-a \-r /$FOLDER/dx\-cwl\-run || true
dx run bcbio\-run\-workflow/$APP_VERSION \-iyaml_template=/$FOLDER/$TEMPLATE\-template.yaml \-isample_spec=/$FOLDER/$PNAME.csv \-isystem_configuration=/$FOLDER/bcbio_system\-dnanexus.yaml \-ioutput_folder=/$FOLDER/dx\-cwl\-run
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Alternatively if you want the latest bcbio code, change the final command to
use the applet. Everything else in the script is identical:
.INDENT 4.0
.INDENT 3.5
.sp
.nf
.ft C
dx run bcbio_resources:/applets/bcbio\-run\-workflow \-iyaml_template=/$FOLDER/$TEMPLATE\-template.yaml \-isample_spec=/$FOLDER/$PNAME.csv \-isystem_configuration=/$FOLDER/bcbio_system\-dnanexus.yaml \-ioutput_folder=/$FOLDER/dx\-cwl\-run
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.sp
The app will lookup all files, prepare a bcbio CWL workflow, convert into a
DNAnexus workflow, and submit to the platform. The workflow runs as a standard
DNAnexus workflow and you can monitor through the command line (with \fBdx find
executions \-\-root job\-YOURJOBID\fP and \fBdx watch\fP) or the web interface
(\fBMonitor\fP tab).
.sp
If you prefer not to use the DNAnexus app, you can also submit jobs locally by
installing \fI\%bcbio\-vm\fP
on your local machine. This can also be useful to test generation of CWL and
manually ensure identification of all your samples and associated files on the
DNAnexus platform.
.INDENT 0.0
.IP 1. 3
Follow the automated\-sample\-config workflow to generate a full
configuration, and generate a CWL description of the workflow:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
TEMPLATE=germline
rm \-rf $PNAME $PNAME\-workflow
bcbio_vm.py template \-\-systemconfig bcbio_system\-dnanexus.yaml $TEMPLATE\-template.yaml $PNAME.csv
bcbio_vm.py cwl \-\-systemconfig bcbio_system\-dnanexus.yaml $PNAME/config/$PNAME.yaml
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 2. 3
Determine project information and login credentials. You\(aqll want to note the
\fBAuth token used\fP and \fBCurrent workspace\fP project ID:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
dx env
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 3. 3
Compile the CWL workflow into a DNAnexus workflow:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
dx\-cwl compile\-workflow $PNAME\-workflow/main\-$PNAME.cwl \e
   \-\-project PROJECT_ID \-\-token $DX_AUTH_TOKEN \e
   \-\-rootdir $FOLDER/dx\-cwl\-run
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 4. 3
Upload sample information from generated CWL and run workflow:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
FOLDER=/bcbio/$PNAME
dx mkdir \-p $DX_PROJECT_ID:$FOLDER/$PNAME\-workflow
dx upload \-p \-\-path $DX_PROJECT_ID:$FOLDER/$PNAME\-workflow $PNAME\-workflow/main\-$PNAME\-samples.json
dx\-cwl run\-workflow $FOLDER/dx\-cwl\-run/main\-$PNAME/main\-$PNAME \e
       $FOLDER/$PNAME\-workflow/main\-$PNAME\-samples.json \e
       \-\-project PROJECT_ID \-\-token $DX_AUTH_TOKEN \e
       \-\-rootdir $FOLDER/dx\-cwl\-run
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.SH DEVELOPMENT NOTES
.sp
bcbio generates a common workflow language description. Internally,
bcbio represents the files and information related to processing as \fI\%a
comprehensive
dictionary\fP\&.
This world object describes the state of a run and associated files, and
new processing steps update or add information to it. The world object
is roughly equivalent to CWL\(aqs JSON\-based input object, but CWL enforces
additional annotations to identify files and models new inputs/outputs
at each step. The work in bcbio is to move from our laissez\-faire
approach to the more structured CWL model.
.sp
The generated CWL workflow is in \fBrun_info\-cwl\-workflow\fP:
.INDENT 0.0
.IP \(bu 2
\fBmain\-*.cwl\fP \-\- the top level CWL file describing the workflow
steps
.IP \(bu 2
\fBmain*\-samples.json\fP \-\- the flattened bcbio world structure
represented as CWL inputs
.IP \(bu 2
\fBwf\-*.cwl\fP \-\- CWL sub\-workflows, describing sample level parallel
processing of a section of the workflow, with potential internal
parallelization.
.IP \(bu 2
\fBsteps/*.cwl\fP \-\- CWL descriptions of sections of code run inside
bcbio. Each of these are potential parallelization points and make up
the nodes in the workflow.
.UNINDENT
.sp
To help with defining the outputs at each step, there is a
\fBWorldWatcher\fP object that can output changed files and world
dictionary objects between steps in the pipeline when running a bcbio in
the standard way. The \fI\%variant
pipeline\fP
has examples using it. This is useful when preparing the CWL definitions
of inputs and outputs for new steps in the \fI\%bcbio CWL step
definitions\fP\&.
.SH TODO
.INDENT 0.0
.IP \(bu 2
Support the full variant calling workflow with additional steps like
ensemble calling, heterogeneity detection and disambiguation.
.IP \(bu 2
Port RNA\-seq and small RNA workflows to CWL.
.UNINDENT
.sp
bcbio has two approaches to running on cloud providers like
\fI\%Amazon Web Services (AWS)\fP,
\fI\%Google Cloud (GCP)\fP and
\fI\%Microsoft Azure\fP\&. For smaller projects
we use a \fI\%simplified ansible based
approach\fP
which automates spinning up single multicore machines for running either
traditional or docs\-cwl bcbio runs.
.sp
For larger distributed projects, we\(aqre actively working on using docs\-cwl
support with runners like \fI\%Cromwell\fP that
directly interface and run on cloud services. We\(aqll document these approaches
here as they\(aqre tested and available.
.sp
For getting started, the CWL docs\-cwl\-installation documentation
describes how to install \fI\%bcbio\-vm\fP,
which provides a wrapper around bcbio that automates interaction with cloud
providers and \fI\%Docker\fP\&. \fBbcbio_vm.py\fP also cleans
up the command line usage to make it more intuitive and provides a superset of
functionality available in \fBbcbio_nextgen.py\fP\&.
.SH GOOGLE CLOUD (GCP)
.sp
Cromwell runs bcbio CWL pipelines on Google Cloud using the
\fI\%Google Pipelines API\fP\&.
.SS Setup
.sp
To setup a Google Compute environment, you\(aqll make use of the \fI\%Web based console\fP and \fI\%gcloud and gsutil from the Google
Cloud SDK\fP, which provide command line
interfacts to manage data in Google Storage and Google Compute instances. You
can install with:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_conda install \-c conda\-forge \-c bioconda google\-cloud\-sdk
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
For authentication, you want to set up a \fI\%Google Cloud Platform service account\fP\&. The environmental variable
\fBGOOGLE_APPLICATION_CREDENTIALS\fP identifies a
\fI\%JSON file of credentials\fP\&.
which bcbio passes to Cromwell for authentication:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
gcloud auth login
gcloud projects create your\-project
gcloud iam service\-accounts create your\-service\-account
gcloud projects add\-iam\-policy\-binding your\-project \-\-member \e
  "serviceAccount:your\-service\-account@your\-project.iam.gserviceaccount.com" \-\-role "roles/owner"
gcloud iam service\-accounts keys create ~/.config/glcoud/your\-service\-account.json \e
  \-\-iam\-account your\-service\-account@your\-project.iam.gserviceaccount.com
export GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud/your\-service\-account.json
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
You\(aqll need a project for your run along with a Google Storage bucket for your
data and run intermediates:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
gcloud config set project your\-project
gsutil mb gs://your\-project
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Additional documentation for Cromwell: \fI\%Google Pipelines API\fP and
\fI\%Google authentication\fP\&.
.SS Data preparation
.sp
Cromwell can localize data present in Google Storage buckets as part of the run
process and bcbio will translate the data present in these storage bucket into
references for the CWL run inputs.
.sp
Upload your data with \fBgsutil\fP:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
gsutil cp your_data.bam gs://your\-project/inputs/
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Create a \fBbcbio_system\-gcp.yaml\fP input file for docs\-cwl\-generate:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
gs:
  ref: gs://bcbiodata/collections
  inputs:
    \- gs://your\-project/inputs
resources:
  default: {cores: 2, memory: 3G, jvm_opts: [\-Xms750m, \-Xmx3000m]}
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Generate a Common Workflow Language representation:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py template \-\-systemconfig bcbio_system\-gcp.yaml ${TEMPLATE}\-template.yaml $PNAME.csv
bcbio_vm.py cwl \-\-systemconfig bcbio_system\-gcp.yaml $PNAME/config/$PNAME.yaml
.ft P
.fi
.UNINDENT
.UNINDENT
.SS Running
.sp
Run the CWL using Cromwell by specifying the project and root Google Storage
bucket for intermediates:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py cwlrun cromwell $PNAME \-\-cloud\-project your\-project \e
    \-\-cloud\-root gs://your\-project/work_cromwell
.ft P
.fi
.UNINDENT
.UNINDENT
.SH AMAZON WEB SERVICES (OLD)
.sp
\fI\%Amazon Web Services (AWS)\fP provides a flexible cloud
based environment for running analyses. Cloud approaches offer the ability to
perform analyses at scale with no investment in local hardware. They also offer
full programmatic control over the environment, allowing bcbio to automate the
entire setup, run and teardown process.
.sp
bcbio\-vm uses \fI\%Elasticluster\fP to build a cluster on AWS with
an encrypted NFS mounted drive and an optional Lustre shared filesystem.
We\(aqre phasing out this approach to cloud support in bcbio and will be actively
moving to Common Workflow Language based approaches.
.SS Data preparation
.sp
The easiest way to organize AWS projects is using an analysis folder inside an
\fI\%S3 bucket\fP\&. Create a bucket and folder for your analysis and
upload fastq, BAM and, optionally, a region BED file. Bucket names should
include only lowercase letters, numbers and hyphens (\fB\-\fP) to conform to
\fI\%S3 bucket naming restrictions\fP
and avoid issues with resolution of SSL keys. You can create buckets and upload
files using the
\fI\%AWS S3 web console\fP,
\fI\%the AWS cli client\fP or specialized tools
like \fI\%gof3r\fP\&.
.sp
You will also need a template file describing the type of run to do and a CSV
file mapping samples in the bucket to names and any other metadata. See the
automated\-sample\-config docs for more details about these files. Also
upload both of these files to S3.
.sp
With that in place, prepare and upload the final configuration to S3 with:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py template s3://your\-project/your\-analysis/template.yaml s3://your\-project/your\-analysis/name.csv
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This will find the input files in the \fBs3://your\-project/your\-analysis\fP bucket, associate
fastq and BAM files with the right samples, and add a found BED files as
\fBvariant_regions\fP in the configuration. It will then upload the final
configuration back to S3 as \fBs3://your\-project/your\-analysis/name.yaml\fP, which you can run
directly from a bcbio cluster on AWS. By default, bcbio will use the us\-east S3
region, but you can specify a different region in the s3 path to the
metadata file: \fBs3://your\-project@eu\-central\-1/your\-analysis/name.csv\fP
.sp
We currently support human analysis with both the GRCh37 and hg19 genomes. We
can also add additional genomes as needed by the community and generally welcome
feedback and comments on reference data support.
.SS Extra software
.sp
We\(aqre not able to automatically install some useful tools in pre\-built docker
containers due to licensing restrictions. Variant calling with GATK requires a
manual download from the \fI\%GATK download\fP site for academic users.  Commercial
users \fI\%need a license\fP for GATK and for somatic calling with muTect. To make these jars available,
upload them to the S3 bucket in a \fBjars\fP directory. bcbio will automatically
include the correct GATK and muTect directives during your run.  Alternatively,
you can also manually specify the path to the jars using a global
\fBresources\fP section of your input sample YAML file:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
resources:
  gatk:
    jar: s3://bcbio\-syn3\-eval/jars/GenomeAnalysisTK.jar
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
As with sample YAML scripts, specify a different region with an \fB@\fP in the
bucket name: \fBs3://your\-project@us\-west\-2/jars/GenomeAnalysisTK.jar\fP
.SS AWS setup
.sp
The first time running bcbio on AWS you\(aqll need to setup permissions, VPCs and
local configuration files. We provide commands to automate all these steps and once
finished, they can be re\-used for subsequent runs. To start you\(aqll need to have
an account at Amazon and your Access Key ID and Secret Key ID from the
\fI\%AWS security credentials page\fP\&. These can be
\fI\%IAM credentials\fP instead of root
credentials as long as they have administrator privileges. Make them available
to bcbio using the standard environmental variables:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
With this in place, two commands setup your elasticluster and AWS environment to
run a bcbio cluster. The first creates public/private keys, a bcbio IAM user,
and sets up an elasticluster config in \fB~/.bcbio/elasticluster/config\fP:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py aws iam \-\-region=us\-east\-1
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The second configures a VPC to host bcbio:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py aws vpc \-\-region=us\-east\-1
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The \fBaws vpc\fP command is idempotent and can run multiple times if you change or
remove parts of the infrastructure. You can also rerun the \fBaws iam\fP command,
but if you\(aqd like to generate a new elasticluster configuration file
(\fB~/.bcbio/elasticluster/config\fP) add the recreate flag: \fBbcbio_vm.py aws iam
\-\-recreate\fP\&. This generates a new set of IAM credentials and public/private
keys. These are only stored in the \fB~/.bcbio\fP directory so you need to fully
recreate them if you delete the old ones.
.SS Running a cluster
.sp
Following this setup, you\(aqre ready to run a bcbio cluster on AWS. We start
from a standard Ubuntu AMI, installing all software for bcbio and the cluster as
part of the boot process.
.sp
To configure your cluster run:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py aws config edit
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
This dialog allows you to define the cluster size and machine resources you\(aqd
like to use. The defaults only have small instances to prevent accidentally
starting an \fI\%expensive run\fP\&. If you\(aqre
planning a run with less than 32 cores, do not use a cluster and instead run
directly on a single machine using one of the \fI\%large r3 or c3 instances\fP\&.
.sp
This script also sets the size of the \fI\%encrypted NFS\-mounted drive\fP, which
you can use to store processing data when running across a distributed
cluster. At scale, you can replace this with a Lustre shared filesystem. See
below for details on launching and attaching a Lustre filesystem to a cluster.
.sp
To ensure everything is correctly configured, run:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py aws info
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
When happy with your setup, start the cluster with:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py aws cluster start
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The cluster will take five to ten minutes to start and be provisioned. If you encounter any
intermittent failures, you can rerun the cluster configuration step with
\fBbcbio_vm.py aws cluster setup\fP or the bcbio\-specific installation with
\fBbcbio_vm.py aws cluster bootstrap\fP\&.
.SS Running Lustre
.sp
Elasticluster mounts the \fB/encrypted\fP directory as a NFS share available
across all of the worker machines. You can use this as a processing directory
for smaller runs but for larger runs may need a scalable distributed file
system. bcbio supports using
\fI\%Intel Cloud Edition for Lustre (ICEL)\fP
to set up a Lustre scratch filesystem on AWS.
.INDENT 0.0
.IP \(bu 2
Subscribe to \fI\%ICEL in the Amazon Marketplace\fP\&.
.IP \(bu 2
By default, the Lustre filesystem will be 2TB and will be accessible to
all hosts in the VPC. Creation takes about ten minutes and can happen in
parallel while elasticluster sets up the cluster. Start the stack:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py aws icel create
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
If you encounter any intermittent failures when installing collectl plugin, that
means lustre server is created successfully, you can rerun the lustre configuration step
with \fBbcbio_vm.py aws icel create \-\-setup\fP\&. If you had any failure creating the lustre
server before the collectl plugin installation, you should stop it, and try again.
.IP \(bu 2
Once the ICEL stack and elasticluster cluster are both running, mount the
filesystem on the cluster:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py aws icel mount
.ft P
.fi
.UNINDENT
.UNINDENT
.IP \(bu 2
The cluster instances will reboot with the Lustre filesystem mounted.
.UNINDENT
.SS Running an analysis
.sp
To run the analysis, connect to the head node with:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py aws cluster ssh
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Create your project directory and link the global bcbio configuration file in there with:
.INDENT 0.0
.IP \(bu 2
NFS file system (no Lustre):
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
mkdir /encrypted/your\-project
cd !$ && mkdir work && cd work
.ft P
.fi
.UNINDENT
.UNINDENT
.IP \(bu 2
Lustre file system:
.INDENT 2.0
.INDENT 3.5
.sp
.nf
.ft C
sudo mkdir /scratch/cancer\-dream\-syn3\-exome
sudo chown ubuntu !$
cd !$ && mkdir work && cd work
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.sp
If you started a single machine, run with:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py run \-n 8 s3://your\-project/your\-analysis/name.yaml
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Where the \fB\-n\fP argument should be the number of cores on the machine.
.sp
To run on a full cluster:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py ipythonprep s3://your\-project/your\-analysis/name.yaml slurm cloud \-n 60
sbatch bcbio_submit.sh
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Where 60 is the total number of cores to use across all the worker nodes.  Of
your total machine cores, allocate 2 for the base bcbio_vm script and IPython
controller instances. The \fI\%SLURM workload manager\fP
distributes jobs across your cluster on a queue called \fBcloud\fP\&.  A
\fBslurm\-PID.out\fP file in the work directory contains the current status of the
job, and \fBsacct_std\fP provides the status of jobs on the cluster. If you are
new to SLURM, here is a summary of useful
\fI\%SLURM commands\fP\&.
.sp
On successful completion, bcbio uploads the results of the analysis back into your s3
bucket and folder as \fBs3://your\-project/your\-analysis/final\fP\&. You can now cleanup the cluster and
Lustre filesystem.
.SS Graphing resource usage
.sp
AWS runs include automatic monitoring of resource usage with
\fI\%collectl\fP\&. bcbio_vm uses collectl statistics
to plot CPU, memory, disk and network usage during each step of a run. To
prepare resource usage plots after finishing an analysis, first copy the
\fBbcbio\-nextgen.log\fP file to your local computer. Either use
\fBbcbio_vm.py elasticluster sftp bcbio\fP to copy from the work directory on AWS
(\fB/encrypted/your\-project/work/log/bcbio\-nextgen.log\fP) or transfer it from the
output S3 bucket (\fByour\-project/your\-analysis/final/DATE_your\-project/bcbio\-nextgen.log\fP).
.sp
If your run worked cleanly you can use the log input file directly. If you had
failures and restarts, or would only like to graph part of the run, you can edit
the timing steps. Run \fBgrep Timing bcbio\-nextgen.log > your\-run.txt\fP to get
the timing steps only, then edit as desired.
.sp
Retrieve the collectl statistics from the AWS cluster and prepare the resource
usage graphs with:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py graph bcbio\-nextgen.log
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
By default the collectl stats will be in \fBmonitoring/collectl\fP and plots in
\fBmonitoring/graphs\fP based on the above log timeframe. If you need to re\-run
plots later after shutting the cluster down, you can use the \fInone\fP cluster flag
by running \fBbcbio_vm.py graph bcbio\-nextgen.log \-\-cluster none\fP\&.
.sp
If you\(aqd like to run graphing from a local non\-AWS run, such as a local HPC cluster,
run \fBbcbio_vm.py graph bcbio\-nextgen.log \-\-cluster local\fP instead.
.sp
For convenience, there\(aqs a "serialize" flag (\(aq\-s\(aq) that saves the dataframe used
for plotting. In order to explore the data and extract specific datapoints
or zoom, one could just deserialize the ouput like a python pickle file:
.INDENT 0.0
.TP
.B 
.nf
\(ga\(ga
.fi

.nf
\(ga
.fi

import cPickle as pickle
with gzip.open("./monitoring/collectl_info.pickle.gz", "rb") as decomp:
.INDENT 7.0
.INDENT 3.5
collectl_info = pickle.load(decomp)
data, hardware, steps = collectl_info[1][0], collectl_info[1][1], collectl_info[1][2]
.UNINDENT
.UNINDENT
.UNINDENT
.sp

.nf
\(ga\(ga
.fi

.nf
\(ga
.fi

.sp
And plot, slice, zoom it in an jupyter notebook using matplotlib,
[highcharts](\fI\%https://github.com/arnoutaertgeerts/python\-highcharts\fP).
.sp
In addition to plots, the
\fI\%summarize_timing.py\fP
utility script prepares a summary table of run times per step.
.SS Shutting down
.sp
The bcbio Elasticluster and Lustre integration can spin up a lot of AWS
resources. You\(aqll be paying for these by the hour so you want to clean them up
when you finish running your analysis. To stop the cluster:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py aws cluster stop
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
To remove the Lustre stack:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_vm.py aws icel stop
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Double check that all instances have been properly stopped by looking in the AWS
console.
.SS Manual configuration
.sp
Experienced \fI\%elasticluster\fP users
can edit the configuration files themselves. bcbio provides a small wrapper
that automatically reads and writes these configurations to avoid users needing
to understand elasticluster internals, but all functionality is fully available.
Edit your \fB~/.bcbio/elasticluster/config\fP file to change parameters. You can
also see the \fI\%latest example configuration\fP\&.
in the bcbio\-vm GitHub repository for more details on the other available options.
.sp
bcbio\-nextgen runs in a temporary work directory which contains a
number of processing intermediates. Pipeline completion extracts the
final useful output files into a separate directory, specified by the
upload\-configuration\&. This configuration allows upload to local
directories, Galaxy, or Amazon S3. Once extracting and confirming the
output files, you can delete the temporary directory to save space.
.SH COMMON FILES
.sp
The output directory contains sample specific output files labeled by
sample name and a more general project directory. The sample
directories contain all of the sample specific output files, while the
project directory contains global files like project summaries or
batched population level variant calls. See the teaching documentation
for a full variant calling example with additional details about configuration
setting and resulting output files.
.SS Project directory
.INDENT 0.0
.IP \(bu 2
\fBproject\-summary.yaml\fP \-\- Top level YAML format summary file with
statistics on read alignments and duplications as well as analysis
specific metrics.
.IP \(bu 2
\fBprograms.txt\fP \-\- Program versions for bcbio\-nextgen and software
run in the pipeline. This enables reproduction of analyses.
.IP \(bu 2
\fBmultiqc\fP run \fI\%MultiQC\fP to gather all QC metrics from different tools, such as,
cutadapt, featureCounts, samtools, STAR ... into an unique HTML report.
.UNINDENT
.SS Sample directories
.INDENT 0.0
.IP \(bu 2
\fBSAMPLE/qc\fP \-\- Directory of quality control runs for the sample.
These include charts and metrics for assessing quality of sequencing
and analysis.
.IP \(bu 2
\fBSAMPLE\-ready.bam\fP \-\- A prepared BAM file of the aligned reads.
Depending on the analysis used, this may include trimmed,
recalibrated and realigned reads following alignment.
.UNINDENT
.SH VARIANT CALLING
.SS Project directory
.INDENT 0.0
.IP \(bu 2
\fBgrading\-summary.csv\fP \-\- Grading details comparing each sample to
a reference set of calls. This will only have information when
providing a validation callset.
.IP \(bu 2
\fBBATCH\-caller.vcf\fP \-\- Variants called for a population/batch of
samples by a particular caller.
.IP \(bu 2
\fBBATCH\-caller.db\fP \-\- A \fI\%GEMINI database\fP associating variant
calls with a wide variety of third party annotations. This provides
a queryable framework for assessing variant quality statistics.
.UNINDENT
.SS Sample directories
.INDENT 0.0
.IP \(bu 2
\fBSAMPLE\-caller.vcf\fP \-\- Variants calls for an individual sample.
.UNINDENT
.SH RNA-SEQ
.SS Project directory
.INDENT 0.0
.IP \(bu 2
\fBannotated_combined.counts\fP \-\- featureCounts counts matrix
with gene symbol as an extra column.
.IP \(bu 2
\fBcombined.counts\fP \-\- featureCounts counts matrix
with gene symbol as an extra column.
.IP \(bu 2
\fBcombined.dexseq\fP \-\- DEXseq counts matrix with
exonID as first column.
.IP \(bu 2
\fBcombined.gene.sf.tmp\fP \-\- Sailfish gene count
matrix normalized to TPM.
.IP \(bu 2
\fBcombined.isoform.sf.tpm\fP \-\- Sailfish transcript
count matix normalized to TPM.
.IP \(bu 2
\fBcombined.sf\fP \-\- Sailfish raw output, all samples
files are pasted one after another.
.IP \(bu 2
\fBtx2gene.csv\fP \-\- Annotation file needed for DESeq2
to use Sailfish output.
.UNINDENT
.SS Sample directories
.INDENT 0.0
.IP \(bu 2
\fBSAMPLE\-transcriptome.bam\fP \-\- BAM file aligned to transcriptome.
.IP \(bu 2
\fBSAMPLE\-ready.counts\fP \-\- featureCounts gene counts output.
.IP \(bu 2
\fBsailfish\fP \-\- Sailfish output.
.UNINDENT
.SH SMALL RNA-SEQ
.SS Project directory
.INDENT 0.0
.IP \(bu 2
\fBcounts_mirna.tsv\fP \-\- miRBase miRNA
count matrix.
.IP \(bu 2
\fBcounts.tsv\fP \-\- miRBase isomiRs count matrix. The ID is made of 5 tags:
miRNA name, SNPs, additions, trimming at 5 and trimming at 3.
Here there is detail explanation of the \fI\%naming\fP .
.IP \(bu 2
\fBcounts_mirna_novel.tsv\fP \-\- miRDeep2 miRNA
count matrix.
.IP \(bu 2
\fBcounts_novel.tsv\fP \-\- miRDeep2 isomiRs. See counts.tsv explanation for more detail.
count matrix.
.IP \(bu 2
\fBseqcluster\fP \-\- output of \fI\%seqcluster\fP tool.
Inside this folder, counts.tsv has count matrix
for all clusters found over the genome.
.IP \(bu 2
\fBseqclusterViz\fP \-\- input file for interactive
browser at \fI\%https://github.com/lpantano/seqclusterViz\fP
.IP \(bu 2
\fBreport\fP \-\- Rmd template to help with downstream
analysis like QC metrics, differential expression, and
clustering.
.UNINDENT
.SS Sample directories
.INDENT 0.0
.IP \(bu 2
\fBSAMPLE\-mirbase\-ready.counts\fP \-\- counts for miRBase miRNAs.
.IP \(bu 2
\fBSAMPLE\-novel\-ready\fP \-\- counts for miRDeep2 novel miRNAs.
.IP \(bu 2
\fBtRNA\fP \-\- output for \fI\%tdrmapper\fP\&.
.UNINDENT
.SH DOWNSTREAM ANALYSIS
.sp
This section collects useful scripts and tools to do downstream analysis of
bcbio\-nextgen outputs. If you have pointers to useful tools, please add them to
the documentation.
.INDENT 0.0
.IP \(bu 2
\fI\%Calculate and plot coverage\fP with matplolib, from Luca Beltrame.
.IP \(bu 2
\fI\%Another way\fP to visualize coverage for targeted NGS (exome) experiments with bedtools and R, from Stephen Turner
.IP \(bu 2
assess the efficiency of targeted enrichment sequencing with \fI\%ngscat\fP
.UNINDENT
.sp
This section provides useful concepts for getting started digging into
the code and contributing new functionality. We welcome contributors
and hope these notes help make it easier to get started.
.SH DEVELOPMENT GOALS
.sp
During development we seek to maximize functionality and usefulness,
while avoiding complexity. Since these goals are sometimes in
conflict, it\(aqs useful to understand the design approaches:
.INDENT 0.0
.IP \(bu 2
Support high level configurability but avoid exposing all program
options. Since pipelines support a wide variety of tools, each with
a large number of options, we try to define configuration variables
at high level based on biological intent and then translate these
into best\-practice options for each tool. The goal is to avoid
having an overwhelming number of input configuration options.
.IP \(bu 2
Provide best\-practice pipelines that make recommended decisions for
processing. Coupled with goal of minimizing configuration
parameters, this requires trust and discussion around algorithm
choices. An example is bwa alignment, which uses \fBbwa aln\fP for
reads shorter than 75bp and \fBbwa mem\fP for longer reads, based on
recommendations from Heng Li. Our general goal is to encourage
discussion and development of best\-practices to make it easy to do
the right thing.
.IP \(bu 2
Support extensive debugging output. In complex distributed systems,
programs fail in unexpected ways even during production runs. We try
to maximize logging to help identify and diagnose these type of
unexpected problems.
.IP \(bu 2
Avoid making mistakes. This results in being conservative about
decisions like deleting file intermediates. Coupled with extensive
logging, we trade off disk usage for making it maximally
easy to restart and debug problems. If you\(aqd like to delete work or
log directories automatically, we recommend doing this as part of
your batch scripts wrapping bcbio\-nextgen.
.IP \(bu 2
Strive for a clean, readable code base. We strive to make the code a
secondary source of information after hand written docs.
Practically, this means maximizing information content in source
files while using in\-line documentation to clarify as needed.
.IP \(bu 2
Focus on a functional coding style with minimal use of global
mutable objects. This approach works well with distributed code and
isolates debugging to individual functions rather than globally
mutable state.
.IP \(bu 2
Make sure your changes integrate correctly by running the test suite before
submitting a pull request. the pipeline is automatically tested in
\fI\%Travis\-CI\fP, and a red label will appear in the pull request if the former
causes any issue.
.UNINDENT
.SH OVERVIEW
.sp
The most useful modules inside \fBbcbio\fP, ordered by likely interest:
.INDENT 0.0
.IP \(bu 2
\fBpipeline\fP \-\- Top level functionality that drives the analysis
pipeline. \fBmain.py\fP contains top level definitions of pipelines
like variant calling and RNAseq, and is the best place to start
understanding the overall organization of the code.
.IP \(bu 2
\fBngsalign\fP \-\- Integration with aligners for high\-throughput
sequencing data. We support individual aligners with their own
separate modules.
.IP \(bu 2
\fBvariation\fP \-\- Tools for variant calling. Individual variant
calling and processing approaches each have their own submodules.
.IP \(bu 2
\fBrnaseq\fP \-\- Run RNA\-seq pipelines, currently supporting TopHat/Cufflinks.
.IP \(bu 2
\fBprovenance\fP \-\- Track third party software versions, command lines
and program flow. Handle writing of debugging details.
.IP \(bu 2
\fBdistributed\fP \-\- Handle distribution of programs across multiple
cores, or across multiple machines using IPython.
.IP \(bu 2
\fBworkflow\fP \-\- Provide high level tools to run customized analyses.
They tie into specialized analyses or visual front ends to make
running bcbio\-nextgen easier for specific common tasks.
.IP \(bu 2
\fBbroad\fP \-\- Code to handle calling Broad tools like GATK and
Picard, as well as other Java\-based programs.
.UNINDENT
.SH DEVELOPMENT INFRASTRUCTURE
.sp
bcbio\-nextgen uses GitHub for code development, and we welcome
pull requests. GitHub makes it easy to establish custom forks of the
code and contribute those back. The Biopython documentation has great
information on \fI\%using git and GitHub\fP for a community developed
project. In short, make a fork of the \fI\%bcbio code\fP by clicking the \fBFork\fP button in
the upper right corner of the GitHub page, commit your changes to this custom
fork and keep it up to date with the main bcbio repository as you develop. The
github help pages have detailed information on keeping your fork updated with
the main github repository (e.g. \fI\%https://help.github.com/articles/syncing\-a\-fork/\fP).
After commiting changes, click \fBNew Pull Request\fP from your fork when you\(aqd like
to submit your changes for integration in bcbio.
.sp
For developing and testing changes locally, you can install directly into a
bcbio\-nextgen installation. The automated bcbio\-nextgen
installer creates an isolated Python environment using \fI\%Anaconda\fP\&. This will be
a subdirectory of your installation root, like
\fB/usr/local/share/bcbio\-nextgen/anaconda\fP\&. The installer also includes a
\fBbcbio_python\fP executable target which is the python in this isolated anaconda
directory. You generally will want to make changes to your local copy of the
bcbio\-nextgen code and then install these into the code directory.
To install from your bcbio\-nextgen source tree for testing do:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_python setup.py install
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
One tricky part that we don\(aqt yet know how to work around is that pip and
standard \fBsetup.py install\fP have different ideas about how to write Python
eggs. \fBsetup.py install\fP will create an isolated python egg directory like
\fBbcbio_nextgen\-0.7.5a\-py2.7.egg\fP, while pip creates an egg pointing to a top
level \fBbcbio\fP directory. Where this gets tricky is that the top level
\fBbcbio\fP directory takes precedence. The best way to work around this problem
is to manually remove the current pip installed bcbio\-nextgen code (\fBrm \-rf
/path/to/anaconda/lib/python2.7/site\-packages/bcbio*\fP) before managing it
manually with \fBbcbio_python setup.py install\fP\&. We\(aqd welcome tips about ways to
force consistent installation across methods.
.sp
If you want to test with bcbio_nextgen code in a separate environment from your
work directory, we recommend using the installer to install only
the bcbio code into a separate directory:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
python bcbio_nextgen_install.py /path/to/testbcbio \-\-nodata \-\-isolate
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Then add this directory to your \fBPATH\fP before your bcbio installation with the
tools: \fBexport PATH=/path/to/testbcbio/anaconda/bin:$PATH\fP, or directly
calling the testing bcbio \fB/path/to/testbcbio/anaconda/bin/bcbio_nextgen.py\fP\&.
.SH BUILDING THE DOCUMENTATION LOCALLY
.sp
If you have added or modified this documentation, to build it locally and see
how it looks like you can do so by running:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
cd docs
make html
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The documentation will be built under \fBdocs/_build/html\fP, open \fBindex.html\fP with your browser to
load your local build.
.sp
If you want to use the same theme that Read The Docs uses, you can do so by installing \fBsphinx_rtd_theme\fP via
\fBpip\fP\&. You will also need to add this in the \fBdocs/conf.py\fP file to use the theme only locally:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
html_theme = \(aqdefault\(aq
on_rtd = os.environ.get(\(aqREADTHEDOCS\(aq, False)
if not on_rtd:  # only import and set the theme if we\(aqre building docs locally
    import sphinx_rtd_theme
    html_theme = \(aqsphinx_rtd_theme\(aq
    html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
.ft P
.fi
.UNINDENT
.UNINDENT
.SH ADDING TOOLS
.SS Aligner
.sp
Write new aligners within their own submodule inside the \fBngsalign\fP
directory. \fI\%bwa.py\fP is a good example to follow along with. There are
two functions to implement, based on which type of alignment you\(aqd
like to allow:
.INDENT 0.0
.IP \(bu 2
\fBalign_bam\fP \-\- Performs alignment given an input BAM file.
Expected to return a sorted BAM output file.
.IP \(bu 2
\fBalign\fP \-\- Performs alignment given FASTQ inputs (gzipped or not). This is
generally expected to implement an approach with unix\-pipe that minimizes
intermediates and disk IO, returning a sorted BAM output file. For
back\-compatibility this can also return a text based SAM file.
.UNINDENT
.sp
See the \fI\%names\fP section for more details on arguments.
.sp
Other required implementation details include:
.INDENT 0.0
.IP \(bu 2
\fBgalaxy_loc_file\fP \-\- Provides the name of the \fI\%Galaxy loc file\fP
used to identify locations of indexes for this aligner. The
automated installer sets up these loc files automatically.
.IP \(bu 2
\fBremap_index_fn\fP \-\- A function that remaps an index from the
Galaxy location file into the exact one for this aligner. This is
useful for tools which aren\(aqt supported by a Galaxy .loc file but
you can locate them relative to another index.
.UNINDENT
.sp
Once implemented, plug the aligner into the pipeline by defining it as
a \fB_tool\fP in \fI\%bcbio/pipeline/alignment.py\fP\&. You can then use it as
normal by specifying the name of the aligner in the \fIaligner\fP section
of your configuration input.
.SS Variant caller
.sp
New variant calling approaches live within their own module inside
\fBbcbio/variation\fP\&. The \fI\%freebayes.py\fP implementation is a good
example to follow for providing your own variant caller. Implement a
function to run variant calling on multiple BAMs in an input region
that takes the following inputs:
.INDENT 0.0
.IP \(bu 2
\fBalign_bams\fP \-\- A list of BAM files to call simultaneously.
.IP \(bu 2
\fBitems\fP \-\- List of \fBdata\fP dictionaries associated with each of the
samples in \fBalign_bams\fP\&. Enables customization of variant calling
based on sample configuration inputs. See documentation on the
\fI\%data\fP dictionary for all of the information
contained inside each \fBdata\fP item. Having multiple
configurations allows customization of sample specific variant calls
using parameters supplied to sample\-configuration\&.
.IP \(bu 2
\fBref_file\fP \-\- Fasta reference genome file.
.IP \(bu 2
\fBassoc_files\fP \-\- Useful associated files for variant calling. This
includes the DbSNP VCF file. It\(aqs a named tuple mapping to files
specified in the configuration. \fI\%bcbio/pipeline/shared.py\fP has the
available inputs.
.IP \(bu 2
\fBregion\fP \-\- A tuple of (chromosome, start, end) specifying the
region to call in.
.IP \(bu 2
\fBout_file\fP\-\- The output file to write to. This should contain calls
for all input samples in the supplied region.
.UNINDENT
.sp
Once implemented, add the variant caller into the pipeline by updating
\fBcaller_fns\fP in the \fBvariantcall_sample\fP function in
\fI\%bcbio/variation/genotype.py\fP\&. You can use it by specifying it in the
\fBvariantcaller\fP parameter of your sample configuration.
.SH ADDING NEW ORGANISMS
.sp
While bcbio\-nextgen and supporting tools receive the most testing and
development on human or human\-like diploid organisms, the algorithms are generic
and we strive to support the wide diversity of organisms used in your
research. We welcome contributors interested in setting up and maintaining
support for their particular research organism, and this section defines the
steps in integrating a new genome. We also welcome suggestions and
implementations that improve this process.
.sp
Setup CloudBioLinux to automatically download and prepare the genome:
.INDENT 0.0
.IP \(bu 2
Add the genome database key and organism name to list of supported organisms in
the CloudBioLinux configuration (\fI\%config/biodata.yaml\fP).
.IP \(bu 2
Add download details to specify where to get the fasta genome files
(\fI\%cloudbio/biodata/genomes.py\fP). CloudBioLinux supports common genome
providers like UCSC and Ensembl directly.
.UNINDENT
.sp
Add the organism to the supported installs within bcbio:
.INDENT 0.0
.IP \(bu 2
This happens in two places: for the initial installer
(\fI\%scripts/bcbio_nextgen_install.py\fP) and the updater (\fI\%bcbio/install.py\fP).
.UNINDENT
.sp
Test installation of genomes by pointing to your local cloudbiolinux edits
during a data installation:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
mkdir \-p tmpbcbio\-install
ln \-s ~/bio/cloudbiolinux tmpbcbio\-install
bcbio_nextgen.py upgrade \-\-data \-\-genomes DBKEY
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Add configuration information to bcbio\-nextgen by creating a
\fBconfig/genomes/DBKEY\-resources.yaml\fP file. Copy an existing minimal
template like \fBcanFam3\fP and edit with pointers to snpEff and other genome
resources. The \fI\%VEP database directory\fP
has Ensembl names. SnpEff has a command to list available databases:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
snpEff databases
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Finally, send pull requests for CloudBioLinux and bcbio\-nextgen and we\(aqll
happily integrate the new genome.
.sp
This will provide basic integration with bcbio and allow running a minimal
pipeline with alignment and quality control. We also have utility scripts in
CloudBioLinux to help with preparing dbSNP (\fI\%utils/prepare_dbsnp.py\fP)
and RNA\-seq (\fI\%utils/prepare_tx_gff.py\fP) resources for some genomes. For
instance, to prepare RNA\-seq transcripts for mm9:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio_python prepare_tx_gff.py \-\-genome\-dir /path/to/bcbio/genomes Mmusculus mm9
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
We are still working on ways to best include these as part of the standard build
and install since they either require additional tools to run locally, or
require preparing copies in S3 buckets.
.SH STANDARD FUNCTION ARGUMENTS
.SS names
.sp
This dictionary provides lane and other \fI\%BAM run group\fP naming
information used to correctly build BAM files. We use the \fBrg\fP
attribute as the ID within a BAM file:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
{\(aqlane\(aq: \(aq7_100326_FC6107FAAXX\(aq,
 \(aqpl\(aq: \(aqillumina\(aq,
 \(aqpu\(aq: \(aq7_100326_FC6107FAAXX\(aq,
 \(aqrg\(aq: \(aq7\(aq,
 \(aqsample\(aq: \(aqTest1\(aq}
.ft P
.fi
.UNINDENT
.UNINDENT
.SS data
.sp
The \fIdata\fP dictionary is a large dictionary representing processing,
configuration and files associated with a sample. The standard
work flow is to pass this dictionary between functions, updating with
associated files from the additional processing. Populating this
dictionary only with standard types allows serialization to JSON for
distributed processing.
.sp
The dictionary is dynamic throughout the workflow depending on the
step, but some of the most useful key/values available throughout are:
.INDENT 0.0
.IP \(bu 2
\fBconfig\fP \-\- Input configuration variables about how to process in
the \fBalgorithm\fP section and locations of programs in the \fBresources\fP
section.
.IP \(bu 2
\fBdirs\fP \-\- Useful directories for building output files or retrieving
inputs.
.IP \(bu 2
\fBmetadata\fP \-\- Top level metadata associated with a sample, specified
in the initial configuration.
.IP \(bu 2
\fBgenome_resources\fP \-\- Naming aliases and associated files
associated with the current genome build. Retrieved from organism
specific configuration files (\fBbuildname\-resources.yaml\fP) this
specifies the location of supplemental organism specific files like
support files for variation and RNA\-seq analysis.
.UNINDENT
.sp
It also contains information the genome build, sample name and
reference genome file throughout. Here\(aqs an example of these inputs:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
{\(aqconfig\(aq: {\(aqalgorithm\(aq: {\(aqaligner\(aq: \(aqbwa\(aq,
                          \(aqcallable_regions\(aq: \(aqanalysis_blocks.bed\(aq,
                          \(aqcoverage_depth\(aq: \(aqlow\(aq,
                          \(aqcoverage_interval\(aq: \(aqregional\(aq,
                          \(aqmark_duplicates\(aq: \(aqsamtools\(aq,
                          \(aqnomap_split_size\(aq: 50,
                          \(aqnomap_split_targets\(aq: 20,
                          \(aqnum_cores\(aq: 1,
                          \(aqplatform\(aq: \(aqillumina\(aq,
                          \(aqquality_format\(aq: \(aqStandard\(aq,
                          \(aqrealign\(aq: \(aqgkno\(aq,
                          \(aqrecalibrate\(aq: \(aqgatk\(aq,
                          \(aqsave_diskspace\(aq: True,
                          \(aqupload_fastq\(aq: False,
                          \(aqvalidate\(aq: \(aq../reference_material/7_100326_FC6107FAAXX\-grade.vcf\(aq,
                          \(aqvariant_regions\(aq: \(aq../data/automated/variant_regions\-bam.bed\(aq,
                          \(aqvariantcaller\(aq: \(aqfreebayes\(aq},
            \(aqresources\(aq: {\(aqbcbio_variation\(aq: {\(aqdir\(aq: \(aq/usr/share/java/bcbio_variation\(aq},
                          \(aqbowtie\(aq: {\(aqcores\(aq: None},
                          \(aqbwa\(aq: {\(aqcores\(aq: 4},
                          \(aqcortex\(aq: {\(aqdir\(aq: \(aq~/install/CORTEX_release_v1.0.5.14\(aq},
                          \(aqcram\(aq: {\(aqdir\(aq: \(aq/usr/share/java/cram\(aq},
                          \(aqgatk\(aq: {\(aqcores\(aq: 2,
                                   \(aqdir\(aq: \(aq/usr/share/java/gatk\(aq,
                                   \(aqjvm_opts\(aq: [\(aq\-Xms750m\(aq, \(aq\-Xmx2000m\(aq],
                                   \(aqversion\(aq: \(aq2.4\-9\-g532efad\(aq},
                          \(aqgemini\(aq: {\(aqcores\(aq: 4},
                          \(aqnovoalign\(aq: {\(aqcores\(aq: 4,
                                        \(aqmemory\(aq: \(aq4G\(aq,
                                        \(aqoptions\(aq: [\(aq\-o\(aq, \(aqFullNW\(aq]},
                          \(aqpicard\(aq: {\(aqcores\(aq: 1,
                                     \(aqdir\(aq: \(aq/usr/share/java/picard\(aq},
                          \(aqsnpEff\(aq: {\(aqdir\(aq: \(aq/usr/share/java/snpeff\(aq,
                                     \(aqjvm_opts\(aq: [\(aq\-Xms750m\(aq, \(aq\-Xmx3g\(aq]},
                          \(aqstampy\(aq: {\(aqdir\(aq: \(aq~/install/stampy\-1.0.18\(aq},
                          \(aqtophat\(aq: {\(aqcores\(aq: None},
                          \(aqvarscan\(aq: {\(aqdir\(aq: \(aq/usr/share/java/varscan\(aq},
                          \(aqvcftools\(aq: {\(aqdir\(aq: \(aq~/install/vcftools_0.1.9\(aq}}},
\(aqgenome_resources\(aq: {\(aqaliases\(aq: {\(aqensembl\(aq: \(aqhuman\(aq,
                                  \(aqhuman\(aq: True,
                                  \(aqsnpeff\(aq: \(aqhg19\(aq},
                      \(aqrnaseq\(aq: {\(aqtranscripts\(aq: \(aq/path/to/rnaseq/ref\-transcripts.gtf\(aq,
                                 \(aqtranscripts_mask\(aq: \(aq/path/to/rnaseq/ref\-transcripts\-mask.gtf\(aq},
                      \(aqvariation\(aq: {\(aqdbsnp\(aq: \(aq/path/to/variation/dbsnp_132.vcf\(aq,
                                    \(aqtrain_1000g_omni\(aq: \(aq/path/to/variation/1000G_omni2.5.vcf\(aq,
                                    \(aqtrain_hapmap\(aq: \(aq/path/to/hg19/variation/hapmap_3.3.vcf\(aq,
                                    \(aqtrain_indels\(aq: \(aq/path/to/variation/Mills_Devine_2hit.indels.vcf\(aq},
                      \(aqversion\(aq: 1},
 \(aqdirs\(aq: {\(aqfastq\(aq: \(aqinput fastq directory\(aq,
              \(aqgalaxy\(aq: \(aqdirectory with galaxy loc and other files\(aq,
              \(aqwork\(aq: \(aqbase work directory\(aq},
 \(aqmetadata\(aq: {\(aqbatch\(aq: \(aqTestBatch1\(aq},
 \(aqgenome_build\(aq: \(aqhg19\(aq,
 \(aqname\(aq: (\(aq\(aq, \(aqTest1\(aq),
 \(aqsam_ref\(aq: \(aq/path/to/hg19.fa\(aq}
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Processing also injects other useful key/value pairs. Here\(aqs an example of
additional information supplied during a variant calling workflow:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
{\(aqprep_recal\(aq: \(aqTest1/7_100326_FC6107FAAXX\-sort.grp\(aq,
 \(aqsummary\(aq: {\(aqmetrics\(aq: [(\(aqReference organism\(aq, \(aqhg19\(aq, \(aq\(aq),
                         (\(aqTotal\(aq, \(aq39,172\(aq, \(aq76bp paired\(aq),
                         (\(aqAligned\(aq, \(aq39,161\(aq, \(aq(100.0\e\e%)\(aq),
                         (\(aqPairs aligned\(aq, \(aq39,150\(aq, \(aq(99.9\e\e%)\(aq),
                         (\(aqPair duplicates\(aq, \(aq0\(aq, \(aq(0.0\e\e%)\(aq),
                         (\(aqInsert size\(aq, \(aq152.2\(aq, \(aq+/\- 31.4\(aq)],
             \(aqpdf\(aq: \(aq7_100326_FC6107FAAXX\-sort\-prep\-summary.pdf\(aq,
             \(aqproject\(aq: \(aqproject\-summary.yaml\(aq},
 \(aqvalidate\(aq: {\(aqconcordant\(aq: \(aqTest1\-ref\-eval\-concordance.vcf\(aq,
              \(aqdiscordant\(aq: \(aqTest1\-eval\-ref\-discordance\-annotate.vcf\(aq,
              \(aqgrading\(aq: \(aqvalidate\-grading.yaml\(aq,
              \(aqsummary\(aq: \(aqvalidate\-summary.csv\(aq},
 \(aqvariants\(aq: [{\(aqpopulation\(aq: {\(aqdb\(aq: \(aqgemini/TestBatch1\-freebayes.db\(aq,
                              \(aqvcf\(aq: None},
               \(aqvalidate\(aq: None,
               \(aqvariantcaller\(aq: \(aqfreebayes\(aq,
               \(aqvrn_file\(aq: \(aq7_100326_FC6107FAAXX\-sort\-variants\-gatkann\-filter\-effects.vcf\(aq}],
 \(aqvrn_file\(aq: \(aq7_100326_FC6107FAAXX\-sort\-variants\-gatkann\-filter\-effects.vcf\(aq,
 \(aqwork_bam\(aq: \(aq7_100326_FC6107FAAXX\-sort\-prep.bam\(aq}
.ft P
.fi
.UNINDENT
.UNINDENT
.SH PARALLELIZATION FRAMEWORK
.sp
bcbio\-nextgen supports parallel runs on local machines using multiple cores and
distributed on a cluster using IPython using a general framework.
.sp
The first parallelization step starts up a set of resources for processing. On a
cluster this spawns a IPython parallel controller and set of engines for
processing. The \fI\%prun (parallel run)\fP \fBstart\fP function is the entry point to
spawning the cluster and the main argument is a \fBparallel\fP dictionary which
contains arguments to the engine processing command. Here is an example input
from an IPython parallel run:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
{\(aqcores\(aq: 12,
 \(aqtype\(aq: \(aqipython\(aq
 \(aqprogs\(aq: [\(aqaligner\(aq, \(aqgatk\(aq],
 \(aqensure_mem\(aq: {\(aqstar\(aq: 30, \(aqtophat\(aq: 8, \(aqtophat2\(aq: 8},
 \(aqmodule\(aq: \(aqbcbio.distributed\(aq,
 \(aqqueue\(aq: \(aqbatch\(aq,
 \(aqscheduler\(aq: \(aqtorque\(aq,
 \(aqresources\(aq: [],
 \(aqretries\(aq: 0,
 \(aqtag\(aq: \(aq\(aq,
 \(aqtimeout\(aq: 15}
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The \fBcores\fP and \fBtype\fP arguments must be present, identifying the total
cores to use and type of processing, respectively. Following that are arguments
to help identify the resources to use. \fBprogs\fP specifies the programs used,
here the aligner, which bcbio looks up from the input sample file, and
gatk. \fBensure_mem\fP is an optional argument that specifies minimum memory
requirements to programs if used in the workflow. The remaining
arguments are all specific to IPython to help it spin up engines on the
appropriate computing cluster.
.sp
A shared component of all processing runs is the identification of used programs
from the \fBprogs\fP argument. The run creation process looks up required memory
and CPU resources for each program from the config\-resources section of
your \fBbcbio_system.yaml\fP file. It combines these resources into required
memory and cores using the logic described in the memory\-management
section of the parallel documentation. Passing these requirements to the cluster
creation process ensures the available machines match program requirements.
.sp
bcbio\-nextgen\(aqs \fI\%pipeline.main\fP code contains examples of starting and using
set of available processing engines. This example starts up machines that use
samtools, gatk and cufflinks then runs an RNA\-seq expression analysis:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
with prun.start(_wprogs(parallel, ["samtools", "gatk", "cufflinks"]),
                samples, config, dirs, "rnaseqcount") as run_parallel:
    samples = rnaseq.estimate_expression(samples, run_parallel)
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The pipelines often reuse a single set of machines for multiple distributed
functions to avoid the overhead of starting up and tearing down machines and
clusters.
.sp
The \fBrun_parallel\fP function returned from the \fBprun.start\fP function enables
running on jobs in the parallel on the created machines. The \fI\%ipython wrapper\fP
code contains examples of implementing this. It is a simple function that takes
two arguments, the name of the function to run and a set of multiple arguments
to pass to that function:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
def run(fn_name, items):
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The \fBitems\fP arguments need to be strings, lists and dictionaries to allow
serialization to JSON format. The internals of the run function take care of
running all of the code in parallel and returning the results back to the caller
function.
.sp
In this setup, the main processing code is fully independent from the parallel
method used so running on a single multicore machine or in parallel on a cluster
return identical results and require no changes to the logical code defining the
pipeline.
.sp
During re\-runs, we avoid the expense of spinning up processing clusters for
completed tasks using simple checkpoint files in the \fBcheckpoints_parallel\fP
directory. The \fBprun.start\fP wrapper writes these on completion of processing
for a group of tasks with the same parallel architecture, and on subsequent runs
will go through these on the local machine instead of parallelizing. The
processing code supports these quick re\-runs by checking for and avoiding
re\-running of tasks when it finds output files.
.sp
Plugging new parallelization approaches into this framework involves writing
interface code that handles the two steps. First, create a cluster of ready to
run machines given the \fBparallel\fP function with expected core and memory
utilization:
.INDENT 0.0
.IP \(bu 2
\fBnum_jobs\fP \-\- Total number of machines to start.
.IP \(bu 2
\fBcores_per_job\fP \-\- Number of cores available on each machine.
.IP \(bu 2
\fBmem\fP \-\- Expected memory needed for each machine. Divide by \fBcores_per_job\fP to
get the memory usage per core on a machine.
.UNINDENT
.sp
Second, implement a \fBrun_parallel\fP function that handles using these resources
to distribute jobs and return results. The \fI\%multicore wrapper\fP and
\fI\%ipython wrapper\fP are useful starting points for understanding the current
implementations.
.SH OVERVIEW
.INDENT 0.0
.INDENT 2.5
[image: Variant calling overview]
[image]
.UNINDENT
.UNINDENT
.SH PARALLEL
.sp
bcbio calculates callable regions following alignment using \fI\%goleft depth\fP\&. These regions determine
breakpoints for analysis, allowing \fI\%parallelization by genomic regions\fP
during variant calling. Defining non\-callable regions allows bcbio to select
breakpoints for parallelization within chromosomes where we won\(aqt accidentally
impact small variant detection. The callable regions also supplement the variant
calls to define positions where not called bases are homozygous reference,
as opposed to true no\-calls with no evidence. The callable regions plus
variant calls is an alternative to gVCF output which explicitly enumerates
reference calls in the output variant file.
.INDENT 0.0
.INDENT 2.5
[image: Parallel approaches]
[image]
Overview of cluster types during parallel execution.UNINDENT
.UNINDENT
.INDENT 0.0
.IP \(bu 2
Variant calling and bcbio training for the \fI\%Harvard Chan Bioinformatics Core
In Depth NGS Data Analysis Course\fP
(10 October 2018): \fI\%slides\fP
.IP \(bu 2
Building a diverse set of validations; lightning talk at \fI\%the GCCBOSC2018 Bioinformatics Community Conference\fP: \fI\%slides\fP
.IP \(bu 2
bcbio training at \fI\%the GCCBOSC2018 Bioinformatics Community Conference\fP, focusing on bcbio CWL integration with
examples of variant calling analyses on Personal Genome Project examples (26
June 2018): \fI\%slides\fP;;
\fI\%video\fP
.IP \(bu 2
Description of bcbio and Common Workflow integration with a focus on
parallelization strategies. From a bcbio discussion with \fI\%Peter Park\(aqs lab at Harvard Medical School\fP (26 January 2018): \fI\%slides\fP
.IP \(bu 2
In depth description of bcbio and Common Workflow Language integration,
including motivation and practical examples of running on clusters, DNAnexus,
SevenBridges and Arvados. From the
\fI\%Boston Bioinformatics Interest Group meeting\fP
(2 November 2017):
\fI\%slides\fP;
\fI\%video\fP
.IP \(bu 2
bcbio practical interoperability with the Common Workflow Language at
\fI\%BOSC 2017\fP
(22 July 2017):
\fI\%slides\fP;
\fI\%video\fP
.IP \(bu 2
teaching variant calling, bcbio and GATK4 validation at the \fI\%Summer 2017 NGS Data Analysis Course
at Harvard Chan School\fP
(6 July 2017): \fI\%slides\fP
.IP \(bu 2
Training course for the \fI\%Cancer Genomics Cloud\fP, decribing how bcbio uses the Common
Workflow Language to run in multiple infrastructures (1 May 2017): \fI\%slides\fP
.IP \(bu 2
\fI\%MIT Bioinformatics Interest Group\fP
about how Common Workflow Language
\fI\%enables interoperability with multiple workflow engines\fP
(3 November 2016): \fI\%slides\fP
and \fI\%video\fP
.IP \(bu 2
\fI\%Broad Institute\fP software engineering
seminar about bcbio validation and integration with Common Workflow Language
and Workflow Definition Language (28 September 2016): \fI\%slides\fP
.IP \(bu 2
Materials from teaching at the \fI\%Summer 2016 NGS Data Analysis Course
at Harvard Chan School\fP
(11 August 2016): \fI\%slides\fP
.IP \(bu 2
\fI\%Bioinformatics Open Source Conference (BOSC) 2016\fP lightning talk on bcbio and common
workflow language (8 July 2016): \fI\%slides\fP and \fI\%video\fP\&.
.IP \(bu 2
Materials from teaching from the \fI\%Spring 2016 NGS Data Analysis Course
at Harvard Chan School\fP
(28 April 2016): \fI\%slides\fP
.IP \(bu 2
Statistical Genetics and Network Science Meeting at \fI\%Channing Division of
Network Medicine\fP
(23 March 2016): \fI\%slides\fP
.IP \(bu 2
Presentation at \fI\%Curoverse\fP Brown Bag Seminar on
bcbio and in progress integration work with \fI\%Common Workflow Language\fP and \fI\%Arvados\fP
(11 January 2016):
\fI\%slides\fP
.IP \(bu 2
Materials from teaching oriented example at Cold Spring Harbor
Laboratory\(aqs \fI\%Advanced Sequencing Technology and Applications course\fP\&.
(18 November 2015): \fI\%slides\fP
.IP \(bu 2
Supporting the common workflow language and Docker in bcbio
\fI\%Bio in Docker symposium\fP
(9 November 2015): \fI\%slides\fP
.IP \(bu 2
Validation on human build 38, HLA typing, low frequency cancer calling and
structural variation for \fI\%Boston Bioinformatics Interest Group (BIG) meeting\fP
(5 November 2015):
\fI\%slides\fP
.IP \(bu 2
Presentation on Research Scientist Careers for \fI\%Iowa State Bioinformatics
Course\fP (23 September 2015): \fI\%slides\fP
.IP \(bu 2
Prioritization of structural variants based on known biological information at
\fI\%BOSC 2015\fP (10 July 2015): \fI\%slides\fP;
\fI\%video\fP
.IP \(bu 2
Overview of variant calling for \fI\%NGS Data Analysis Course at Harvard Medical School\fP
(19 May 2015): \fI\%slides\fP
.IP \(bu 2
\fI\%NGS Glasgow\fP (23 April 2015):
\fI\%slides\fP
.IP \(bu 2
\fI\%Boston Computational Biology and Bioinformatics meetup\fP
(1 April 2015): \fI\%slides\fP
.IP \(bu 2
\fI\%Program in Genetic Epidemiology and Statistical Genetics seminar series\fP at
Harvard Chan School (6 February 2015): \fI\%slides\fP
.IP \(bu 2
Talk at \fI\%Good Start Genetics\fP (23
January 2015): \fI\%slides\fP
.IP \(bu 2
Boston area \fI\%Bioinformatics Interest Group\fP (15 October 2014):
\fI\%slides\fP
.IP \(bu 2
University of Georgia \fI\%Institute of Bioinformatics\fP (12 September 2014):
\fI\%slides\fP
.IP \(bu 2
Intel Life Sciences discussion (7 August 2014): \fI\%slides\fP
.IP \(bu 2
Bioinformatics Open Source Conference (BOSC) 2014: \fI\%slides\fP,
\fI\%conference website\fP
.IP \(bu 2
Galaxy Community Conference 2014: \fI\%slides\fP,
\fI\%conference website\fP
.IP \(bu 2
\fI\%bcbio hackathon at Biogen\fP (3 June 2014)
.IP \(bu 2
\fI\%Harvard ABCD group slides\fP (17 April 2014)
.IP \(bu 2
\fI\%BIG meeting\fP (February 2014)
.IP \(bu 2
\fI\%Novartis slides\fP (21 January 2014)
.IP \(bu 2
Mt Sinai: Strategies for accelerating the genomic sequencing pipeline: \fI\%Mt Sinai workshop slides\fP,
\fI\%Mt Sinai workshop website\fP
.IP \(bu 2
Genome Informatics 2013 \fI\%GI 2013 Presentation slides\fP
.IP \(bu 2
Bioinformatics Open Source Conference 2013: \fI\%BOSC 2013 Slides\fP, \fI\%BOSC 2013
Video\fP, \fI\%BOSC 2013 Conference website\fP
.IP \(bu 2
Arvados Summit 2013: \fI\%Arvados Summit Slides\fP, \fI\%Arvados Summit website\fP
.IP \(bu 2
Scientific Python 2013: \fI\%SciPy 2013 Video\fP, \fI\%SciPy 2013 Conference website\fP
.UNINDENT
.sp
Feel free to reuse any images or text from these talks. The \fI\%slides are on GitHub\fP\&.
.SH ABSTRACT
.sp
\fBCommunity Development of Validated Variant Calling Pipelines\fP
.sp
\fIBrad Chapman, Rory Kirchner, Oliver Hofmann and Winston Hide Harvard
School of Public Health, Bioinformatics Core, Boston, MA, 02115\fP
.sp
Translational research relies on accurate identification of genomic
variants. However, rapidly changing best practice approaches in
alignment and variant calling, coupled with large data sizes, make it a
challenge to create reliable and reproducible variant calls. Coordinated
community development can help overcome these challenges by sharing
testing and updates across multiple groups. We describe bcbio\-nextgen, a
distributed multi\-architecture pipeline that automates variant calling,
validation and organization of results for query and visualization. It
creates an easily installable, reliable infrastructure from
best\-practice open source tools with the following goals:
.INDENT 0.0
.IP \(bu 2
\fBQuantifiable:\fP Validates variant calls against known reference
materials developed by the \fI\%Genome in a Bottle\fP consortium. The
\fI\%bcbio.variation\fP toolkit automates scoring and assessment of calls
to identify regressions in variant identification as calling
pipelines evolve. Incorporation of multiple variant calling
approaches from \fI\%Broad\(aqs GATK best practices\fP and the \fI\%Marth lab\(aqs
gkno software\fP enables informed comparisons between current and
future algorithms.
.IP \(bu 2
\fBScalable:\fP bcbio\-nextgen handles large population studies with
hundreds of whole genome samples by parallelizing on a wide variety
of schedulers and multicore machines, setting up different ad hoc
cluster configurations for each workflow step. Work in progress
includes integration with virtual environments, including \fI\%Amazon Web
Services\fP and \fI\%OpenStack\fP\&.
.IP \(bu 2
\fBAccessible:\fP Results automatically feed into tools for query and
investigation of variants. The \fI\%GEMINI framework\fP provides a
queryable database associating variants with a wide variety of genome
annotations. The \fI\%o8\fP web\-based tool visualizes the work of variant
prioritization and assessment.
.IP \(bu 2
\fBCommunity developed:\fP bcbio\-nextgen is widely used in multiple
sequencing centers and research laboratories. We actively encourage
contributors to the code base and make it easy to get started with a
fully automated installer and updater that prepares all third party
software and reference genomes.
.UNINDENT
.SH LINKS FROM THE PRESENTATION
.INDENT 0.0
.IP \(bu 2
\fI\%HugeSeq\fP
.IP \(bu 2
\fI\%Genome Comparison & Analytic Testing\fP at Bioplanet
.IP \(bu 2
\fI\%Peter Block’s “Community” book\fP
.IP \(bu 2
\fI\%CloudBioLinux\fP and \fI\%Homebrew Science\fP as installation frameworks;
\fI\%Conda\fP as Python environment
.IP \(bu 2
bcbio \fI\%documentation\fP at ReadTheDocs
.IP \(bu 2
\fI\%Arvados framework\fP for meta data tracking, NGS processing and data
provenance
.IP \(bu 2
Notes on \fI\%improved scaling for NGS workflows\fP
.IP \(bu 2
Genomic Reference Materials from \fI\%Genome in a Bottle\fP
.IP \(bu 2
Comparison of \fI\%aligners and callers\fP using NIST reference materials
.IP \(bu 2
Callers and \fI\%minimal BAM preparation workflows\fP
.IP \(bu 2
\fI\%Coverage assessment\fP
.UNINDENT
.sp
This is a teaching orientated example of using bcbio from the Cold Spring Harbor
Laboratory\(aqs \fI\%Advanced Sequencing Technology and Applications course\fP\&. This uses
cancer tumor normal data from the \fI\%ICGC\-TCGA DREAM synthetic 3 challenge\fP, subset to exomes on
chromosome 6 to reduce runtimes. It demonstrates:
.INDENT 0.0
.IP \(bu 2
Running a cancer tumor/normal workflow through bcbio.
.IP \(bu 2
Analysis with human genome build 38.
.IP \(bu 2
SNP and indel detection, with 3 variant callers and an ensemble method.
.IP \(bu 2
Structural variant calling, with 2 callers.
.IP \(bu 2
Prioritization of structural variants for cancer associated genes in
\fI\%CIViC\fP\&.
.IP \(bu 2
HLA typing.
.IP \(bu 2
Validation of both small and structural variants against truth sets.
.UNINDENT
.SH LOADING PRE-RUN ANALYSIS
.sp
To save downloading the genome data and running the analysis, we have a
pre\-prepared AMI with the data and analysis run. Use the \fI\%AWS Console\fP to launch the pre\-built AMI \-\- search
Community AMIs for ami\-5e84fe34. Any small instance type is fine for exploring
the configuration, run directory and output files. Make sure you associate a
public IP and a security group that allows remote ssh.
.sp
Once launched, ssh into the remote machine with \fBssh \-i your\-keypair
ubuntu@public.ip.address\fP to explore the inputs and outputs.
The default PATH contains bcbio and third party programs in \fB/usr/local/bin\fP,
with the biological data installed in \fB/usr/local/share/bcbio\fP\&. The run is in
a \fB~/run/cancer\-syn3\-chr6\fP\&.
.SH INPUT CONFIGURATION FILE
.sp
To run bcbio, you prepare a small configuration file describing your analysis.
You can \fI\%prepare it manually or use an automated configuration method\fP\&.
The example has a pre\-written configuration file with tumor/normal data located
in the 
.nf
\(ga\(ga
.fi
config\(ga directory and this section walks through the settings.
.sp
You define the type of analysis (variant calling) along with the input files and
genome build:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
analysis: variant2
files: [../input/cancer\-syn3\-chr6\-tumor\-1.fq.gz, ../input/cancer\-syn3\-chr6\-tumor\-2.fq.gz]
genome_build: hg38
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Sample description and assignment as a tumor sample, called together with a
matched normal:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
description: syn3\-tumor
metadata:
  batch: syn3
  phenotype: tumor
  sex: female
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Next it defines parameters for running the analysis. First we pick our aligner
(bwa mem):
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
algorithm:
  aligner: bwa
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Post\-alignment, we mark duplicates but do not perform recalibration and realignment:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
mark_duplicates: true
recalibrate: false
realign: false
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
We call variants in exome regions on chromosome 6 using a BED file input, call
variants as low as 2% in the tumor sample, and use 3 variant callers along with
an ensemble method that combines results for any found in 2 out of 3:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
variant_regions: ../input/NGv3\-chr6\-hg38.bed
min_allele_fraction: 2
variantcaller: [vardict, freebayes, varscan]
ensemble:
  numpass: 2
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
For structural variant calling, we use two callers and prioritize variants to
those found in the CIViC database:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
svcaller: [lumpy, manta]
svprioritize: cancer/civic
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Call HLA types with OptiType:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
hlacaller: optitype
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Finally, we validate both the small variants and structural variants. These use
pre\-installed validation sets that come with bcbio. We limit validation regions
to avoid low complexity regions, which cause bias in 
.nf
\(ga\(ga
.fi
validating indels
<\fI\%http://bcb.io/2014/05/12/wgs\-trio\-variant\-evaluation/\fP>\(ga_:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
exclude_regions: [lcr]
validate: dream\-syn3\-crossmap/truth_small_variants.vcf.gz
validate_regions: dream\-syn3\-crossmap/truth_regions.bed
svvalidate:
  DEL: dream\-syn3\-crossmap/truth_DEL.bed
  DUP: dream\-syn3\-crossmap/truth_DUP.bed
  INV: dream\-syn3\-crossmap/truth_INV.bed
.ft P
.fi
.UNINDENT
.UNINDENT
.SH OUTPUT FILES
.sp
Output files are in \fB~/run/cancer\-syn3\-chr6/final\fP, extracted from the full
work directory in \fB~/run/cancer\-syn3\-chr6/work\fP\&.
.sp
The directories with sample information are in \fBsyn3\-tumor/\fP\&. Aligned BAMs
include a \fB\-ready.bam\fP file with all of the original reads (including split
and discordants) and separate files with only the split (\fB\-sr.bam\fP) and
discordant (\fB\-disc.bam\fP) reads:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
syn3\-tumor\-ready.bam
syn3\-tumor\-ready.bam.bai
syn3\-tumor\-sr.bam
syn3\-tumor\-sr.bam.bai
syn3\-tumor\-disc.bam
syn3\-tumor\-disc.bam.bai
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
SNP and indel calls for 3 callers, plus combined ensemble calls:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
syn3\-tumor\-ensemble.vcf.gz
syn3\-tumor\-ensemble.vcf.gz.tbi
syn3\-tumor\-freebayes.vcf.gz
syn3\-tumor\-freebayes.vcf.gz.tbi
syn3\-tumor\-varscan.vcf.gz
syn3\-tumor\-varscan.vcf.gz.tbi
syn3\-tumor\-vardict.vcf.gz
syn3\-tumor\-vardict.vcf.gz.tbi
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Structural variant calls for 2 callers, plus a simplified list of structural
variants in cancer genes of interest:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
syn3\-tumor\-sv\-prioritize.tsv
syn3\-tumor\-lumpy.vcf.gz
syn3\-tumor\-lumpy.vcf.gz.tbi
syn3\-tumor\-manta.vcf.gz
syn3\-tumor\-manta.vcf.gz.tbi
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
HLA typing results:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
syn3\-tumor\-hla\-optitype.csv
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Validation results from comparisons against truth set, including plots:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
syn3\-tumor\-sv\-validate.csv
syn3\-tumor\-sv\-validate\-DEL.png
syn3\-tumor\-sv\-validate\-df.csv
syn3\-tumor\-sv\-validate\-DUP.png
syn3\-tumor\-sv\-validate\-INV.png
syn3\-tumor\-validate.png
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The top level directory for the project, \fB2015\-11\-18_syn3\-cshl/\fP has files
relevant to the entire run. There is a consolidated quality control report:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
multiqc/multiqc_report.html
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
Povenance information, with log files of all commands run and program versions used:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
bcbio\-nextgen.log
bcbio\-nextgen\-commands.log
programs.txt
data_versions.csv
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
A top level summary of metrics for alignment, variant calling and coverage that
is useful downstream:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
project\-summary.yaml
.ft P
.fi
.UNINDENT
.UNINDENT
.SH PREPARING AND RUNNING
.sp
The steps to prepare an AMI from a bare machine and run the analysis. These are
pre\-done on the teaching AMI to save time:
.INDENT 0.0
.IP 1. 3
Use the \fI\%AWS Console\fP to launch
a Ubuntu Server 14.04 (ami\-d05e75b8). Start an m4.4xlarge instance with a
100Gb SSD. Make sure you associate a public IP and can ssh in externally.
.IP 2. 3
SSH to your instance:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
ssh \-i ~/.ec2/your\-key.pem ubuntu@public\-ip
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 3. 3
Install bcbio with hg38 data:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
sudo apt\-get update
sudo apt\-get install \-y build\-essential zlib1g\-dev wget curl python\-setuptools git \e
                        openjdk\-7\-jdk openjdk\-7\-jre ruby libncurses5\-dev libcurl4\-openssl\-dev \e
                        libbz2\-dev unzip pigz bsdmainutils
wget https://raw.githubusercontent.com/bcbio/bcbio\-nextgen/master/scripts/bcbio_nextgen_install.py
python bcbio_nextgen_install.py /usr/local/share/bcbio \-\-tooldir /usr/local \e
       \-\-genomes hg38 \-\-aligners bwa \-\-sudo \-\-isolate \-u development
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 4. 3
Install the analysis data:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
mkdir \-p run
cd run
wget https://raw.githubusercontent.com/bcbio/bcbio\-nextgen/master/config/teaching/cancer\-syn3\-chr6\-prep.sh
bash cancer\-syn3\-chr6\-prep.sh
.ft P
.fi
.UNINDENT
.UNINDENT
.IP 5. 3
Run the analysis:
.INDENT 3.0
.INDENT 3.5
.sp
.nf
.ft C
cd cancer\-syn3\-chr6/work
bcbio_nextgen.py ../config/cancer\-syn3\-chr6.yaml \-n 16
.ft P
.fi
.UNINDENT
.UNINDENT
.UNINDENT
.sp
\fI\%https://github.com/bcbio/bcbio\-nextgen\fP
.SH SMALL RNA-SEQ
.sp
Data was analyzed with bcbio\-nextgen (\fI\%https://github.com/bcbio/bcbio\-nextgen\fP)
using piDNA to detect the adapter, cutadapt to remove it, STAR/bowtie to align against
the genome and seqcluster to detect small RNA transcripts. miRNAs were detected using
miraligner tool with miRBase as the reference miRNA database. tRNA profiles were
detected using tdrmapper tool. mirdeep2 was used for discovery of novel miRNAs. FastQC
was used for QC metrics and multiqc for reporting.
.sp
Download BIB format: \fI\%https://github.com/bcbio/bcbio\-nextgen/tree/master/docs/contents/misc/bcbio\-smallrna.bib\fP
.SS Tools
.INDENT 0.0
.IP \(bu 2
Tsuji J, Weng Z. (2016) DNApi: A De Novo Adapter Prediction Algorithm for Small
RNA Sequencing Data. 11(10):e0164228. \fI\%http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0164228\fP
.IP \(bu 2
Andrews, S. (2010). FastQC: A quality control tool for high throughput sequence data. Bioinformatics. doi:citeulike\-article\-id:11583827
.IP \(bu 2
Didion, J. P., Martin, M., & Collins, F. S. (2017). Atropos: specific, sensitive, and speedy trimming of sequencing reads. \fI\%http://doi.org/10.7287/peerj.preprints.2452v4\fP
.IP \(bu 2
Dale, R. K., Pedersen, B. S., & Quinlan, A. R. (2011). Pybedtools: A flexible Python library for manipulating genomic datasets and annotations. Bioinformatics, 27(24), 3423–3424. doi:10.1093/bioinformatics/btr539
.IP \(bu 2
Quinlan, A. R., & Hall, I. M. (2010). BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics, 26(6), 841–842. doi:10.1093/bioinformatics/btq033
.IP \(bu 2
Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J., & Prins, P. (2015). Sambamba: Fast processing of NGS alignment formats. Bioinformatics, 31(12), 2032–2034. doi:10.1093/bioinformatics/btv098
.IP \(bu 2
Heger, A. (2009). Pysam. github.com. Retrieved from \fI\%https://github.com/pysam\-developers/pysam\fP
.IP \(bu 2
Li, H. (2011). A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 27(21), 2987–2993. doi:10.1093/bioinformatics/btr509
.IP \(bu 2
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., … Durbin, R. (2009). The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25(16), 2078–2079. doi:10.1093/bioinformatics/btp352
.IP \(bu 2
Pantano, L., Estivill, X., & Martí, E. (2010). SeqBuster, a bioinformatic tool for the processing and analysis of small RNAs datasets, reveals ubiquitous miRNA modifications in human embryonic cells. Nucleic Acids Research, 38(5), e34. Retrieved from \fI\%http://www.ncbi.nlm.nih.gov/pubmed/20008100\fP
.IP \(bu 2
Pantano, L., Friedlander, M. R., Escaramis, G., Lizano, E., Pallares\-Albanell, J., Ferrer, I., … Marti, E. (2015). Specific small\-RNA signatures in the amygdala at premotor and motor stages of Parkinson’s disease revealed by deep sequencing analysis. Bioinformatics (Oxford, England). doi:10.1093/bioinformatics/btv632
.UNINDENT
.sp
For the alignment, add what you have used:
.INDENT 0.0
.IP \(bu 2
Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., … Gingeras, T. R. (2013). STAR: Ultrafast universal RNA\-seq aligner. Bioinformatics, 29(1), 15–21. doi:10.1093/bioinformatics/bts635
.IP \(bu 2
Langmead, B., Trapnell, C., Pop, M., & Salzberg, S. L. (2009). Ultrafast and memory\-efficient alignment of short DNA sequences to the human genome. Genome Biology, 10, R25. doi:10.1186/gb\-2009\-10\-3\-r25
.IP \(bu 2
Kim, D., Langmead, B. & Salzberg, SL. (2016). HISAT: a fast spliced aligner with low memory requirements. Nature Methods, 12(4): 357–360. doi: 10.1038/nmeth.3317
.UNINDENT
.sp
If you used TopHat2 for alignment:
.INDENT 0.0
.IP \(bu 2
Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Kelley, R. & Salzberg SL. (2013). TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biology, 14(4): R36. doi: 10.1186/gb\-2013\-14\-4\-r36
.IP \(bu 2
Brueffer, C. & Saal, LH. (2016). TopHat\-Recondition: A post\-processor for TopHat unmapped reads. BMC Bioinformatics, 17(1):199. doi: 10.1186/s12859\-016\-1058\-x
.UNINDENT
.sp
If you have in the output novel miRNA discovering, add:
.INDENT 0.0
.IP \(bu 2
Friedlander, M. R., MacKowiak, S. D., Li, N., Chen, W., & Rajewsky, N. (2012). MiRDeep2 accurately identifies known and hundreds of novel microRNA genes in seven animal clades. Nucleic Acids Research, 40(1), 37–52. doi:10.1093/nar/gkr688
.UNINDENT
.sp
If you have tRNA mapping output, add:
.INDENT 0.0
.IP \(bu 2
Selitsky, S. R., & Sethupathy, P. (2015). tDRmapper: challenges and solutions to mapping, naming, and quantifying tRNA\-derived RNAs from human small RNA\-sequencing data. BMC Bioinformatics, 16(1), 354. doi:10.1186/s12859\-015\-0800\-0
.UNINDENT
.sp
If you have miRge activated:
.INDENT 0.0
.IP \(bu 2
Yin Lu, Alexander S. Baras, Marc K Halushka. miRge2.0: An updated tool to comprehensively analyze microRNA sequencing data. bioRxiv.org.
.UNINDENT
.sp
If you have MINTmap activated:
.INDENT 0.0
.IP \(bu 2
Loher, P, Telonis, AG, Rigoutsos, I. MINTmap: fast and exhaustive profiling of nuclear and mitochondrial tRNA fragments from short RNA\-seq data. Sci Rep. 2017;7 :41184. doi: 10.1038/srep41184. PubMed PMID:28220888 PubMed Central PMC5318995.
.UNINDENT
.SS Data
.INDENT 0.0
.IP \(bu 2
Griffiths\-Jones, S. (2004). The microRNA Registry. Nucleic Acids Research, 32(Database issue), D109–11. doi:10.1093/nar/gkh023
.IP \(bu 2
Griffiths\-Jones, S. (2006). miRBase: the microRNA sequence database. Methods in Molecular Biology (Clifton, N.J.), 342, 129–38. doi:10.1385/1\-59745\-123\-1:129
.IP \(bu 2
Griffiths\-Jones, S., Saini, H. K., Van Dongen, S., & Enright, A. J. (2008). miRBase: Tools for microRNA genomics. Nucleic Acids Research, 36(SUPPL. 1). doi:10.1093/nar/gkm952
.IP \(bu 2
Kozomara, A., & Griffiths\-Jones, S. (2011). MiRBase: Integrating microRNA annotation and deep\-sequencing data. Nucleic Acids Research, 39(SUPPL. 1). doi:10.1093/nar/gkq1027
.IP \(bu 2
Kozomara, A., & Griffiths\-Jones, S. (2014). MiRBase: Annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Research, 42(D1). doi:10.1093/nar/gkt1181
.UNINDENT
.INDENT 0.0
.IP \(bu 2
genindex
.IP \(bu 2
modindex
.IP \(bu 2
search
.UNINDENT
.SH AUTHOR
Brad Chapman
.SH COPYRIGHT
2019, bcbio-nextgen contributors
.\" Generated by docutils manpage writer.
.