BCFTOOLS(1) | BCFTOOLS(1) |
NAME¶
bcftools - utilities for variant calling and manipulating VCFs and BCFs.
SYNOPSIS¶
bcftools [--version|--version-only] [--help] [COMMAND] [OPTIONS]
DESCRIPTION¶
BCFtools is a set of utilities that manipulate variant calls in the Variant Call Format (VCF) and its binary counterpart BCF. All commands work transparently with both VCFs and BCFs, both uncompressed and BGZF-compressed.
Most commands accept VCF, bgzipped VCF and BCF with filetype detected automatically even when streaming from a pipe. Indexed VCF and BCF will work in all situations. Un-indexed VCF and BCF and streams will work in most, but not all situations. In general, whenever multiple VCFs are read simultaneously, they must be indexed and therefore also compressed. (Note that files with non-standard index names can be accessed as e.g. "bcftools view -r X:2928329 file.vcf.gz##idx##non-standard-index-name".)
BCFtools is designed to work on a stream. It regards an input file "-" as the standard input (stdin) and outputs to the standard output (stdout). Several commands can thus be combined with Unix pipes.
VERSION¶
This manual page was last updated 2024-04-15 and refers to bcftools git version 1.20.
BCF1¶
The obsolete BCF1 format output by versions of samtools <= 0.1.19 is not compatible with this version of bcftools. To read BCF1 files one can use the view command from old versions of bcftools packaged with samtools versions <= 0.1.19 to convert to VCF, which can then be read by this version of bcftools.
samtools-0.1.19/bcftools/bcftools view file.bcf1 | bcftools view
VARIANT CALLING¶
See bcftools call for variant calling from the output of the samtools mpileup command. In versions of samtools <= 0.1.19 calling was done with bcftools view. Users are now required to choose between the old samtools calling model (-c/--consensus-caller) and the new multiallelic calling model (-m/--multiallelic-caller). The multiallelic calling model is recommended for most tasks.
FILTERING EXPRESSIONS¶
See EXPRESSIONS
LIST OF COMMANDS¶
For a full list of available commands, run bcftools without arguments. For a full list of available options, run bcftools COMMAND without arguments.
LIST OF SCRIPTS¶
Some helper scripts are bundled with the bcftools code.
COMMANDS AND OPTIONS¶
Common Options¶
The following options are common to many bcftools commands. See usage for specific commands to see if they apply.
FILE
-c, --collapse snps|indels|both|all|some|none|id
none
some
all
snps
indels
both
id
-f, --apply-filters LIST
--no-version
-o, --output FILE
-O, --output-type b|u|z|v[0-9]
-r, --regions chr|chr:pos|chr:beg-end|chr:beg-[,...]
-R, --regions-file FILE
--regions-overlap pos|record|variant|0|1|2
-s, --samples [^]LIST
bcftools view -Ou -s sample1,sample2 file.vcf | bcftools query -f %INFO/AC\t%INFO/AN\n
-S, --samples-file [^]FILE
sample1 1
sample2 2
sample3 2
or
sample1 M
sample2 F
sample3 F
If the second column is not present, the sex "F" is assumed. With bcftools call -C trio, PED file is expected. The program ignores the first column and the last indicates sex (1=male, 2=female), for example:
ignored_column daughterA fatherA motherA 2
ignored_column sonB fatherB motherB 1
-t, --targets [^]chr|chr:pos|chr:from-to|chr:from-[,...]
-T, --targets-file [^]FILE
With the call -C alleles command, third column of the targets file must be comma-separated list of alleles, starting with the reference allele. Note that the file must be compressed and indexed. Such a file can be easily created from a VCF using:
bcftools query -f'%CHROM\t%POS\t%REF,%ALT\n' file.vcf | bgzip -c > als.tsv.gz && tabix -s1 -b2 -e2 als.tsv.gz
--targets-overlap pos|record|variant|0|1|2
--threads INT
-W[FMT], -W[=FMT], --write-index[=FMT]
bcftools annotate [OPTIONS] FILE¶
Add or remove annotations.
-a, --annotations file
When multiple ALT alleles are present in the annotation file (given as comma-separated list of alleles), at least one must match one of the alleles in the corresponding VCF record. Similarly, at least one alternate allele from a multi-allelic VCF record must be present in the annotation file.
Missing values can be added by providing "." in place of actual value and using the missing value modifier with -c, such as ".TAG".
Note that flag types, such as "INFO/FLAG", can be annotated by including a field with the value "1" to set the flag, "0" to remove it, or "." to keep existing flags. See also -c, --columns and -h, --header-lines.
# Sample annotation file with columns CHROM, POS, STRING_TAG, NUMERIC_TAG
1 752566 SomeString 5
1 798959 SomeOtherString 6
-c, --columns list
If the annotation file is a VCF/BCF, only the edited columns/tags must be present and their order does not matter. The columns ID, QUAL, FILTER, INFO and FORMAT can be edited, where INFO tags can be written both as "INFO/TAG" or simply "TAG", and FORMAT tags can be written as "FORMAT/TAG" or "FMT/TAG". The imported VCF annotations can be renamed as "DST_TAG:=SRC_TAG" or "FMT/DST_TAG:=FMT/SRC_TAG".
To carry over all INFO annotations, use "INFO". To add all INFO annotations except "TAG", use "^INFO/TAG". By default, existing values are replaced.
By default, existing tags are overwritten unless the source value is a missing value (i.e. "."). If also missing values should be carried over (and overwrite existing tags), use ".TAG" instead of "TAG". To add annotations without overwriting existing values (that is, to add tags that are absent or to add values to existing tags with missing values), use "+TAG" instead of "TAG". These can be combined, for example ".+TAG" can be used to add TAG even if the source value is missing but only if TAG does not exist in the target file; existing tags will not be overwritten. To append to existing values (rather than replacing or leaving untouched), use "=TAG" (instead of "TAG" or "+TAG"). To replace only existing values without modifying missing annotations, use "-TAG". To match the record also by ID or INFO/END, in addition to REF and ALT, use "~ID" or "~INFO/END". If position needs to be replaced, mark the column with the new position as "~POS".
If the annotation file is not a VCF/BCF, all new annotations must be defined via -h, --header-lines.
See also the -l, --merge-logic option.
-C, --columns-file file
-e, --exclude EXPRESSION
--force
-h, --header-lines file
##INFO=<ID=NUMERIC_TAG,Number=1,Type=Integer,Description="Example header line">
##INFO=<ID=STRING_TAG,Number=1,Type=String,Description="Yet another header line">
-I, --set-id [+]FORMAT
bcftools annotate --set-id +'%CHROM\_%POS\_%REF\_%FIRST_ALT' file.vcf
-i, --include EXPRESSION
-k, --keep-sites
-l, --merge-logic tag:first|append|append-missing|unique|sum|avg|min|max[,...]
-m, --mark-sites TAG
--min-overlap ANN:'VCF'
--no-version
-o, --output FILE
-O, --output-type b|u|z|v[0-9]
--pair-logic snps|indels|both|all|some|exact
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
--regions-overlap 0|1|2
--rename-annots file
--rename-chrs file
-s, --samples [^]LIST
-S, --samples-file FILE
--single-overlaps
--threads INT
-x, --remove list
-W[FMT], -W[=FMT], --write-index[=FMT]
Examples:
# Remove three fields
bcftools annotate -x ID,INFO/DP,FORMAT/DP file.vcf.gz
# Remove all INFO fields and all FORMAT fields except for GT and PL
bcftools annotate -x INFO,^FORMAT/GT,FORMAT/PL file.vcf
# Add ID, QUAL and INFO/TAG, not replacing TAG if already present
bcftools annotate -a src.bcf -c ID,QUAL,+TAG dst.bcf
# Carry over all INFO and FORMAT annotations except FORMAT/GT
bcftools annotate -a src.bcf -c INFO,^FORMAT/GT dst.bcf
# Annotate from a tab-delimited file with six columns (the fifth is ignored),
# first indexing with tabix. The coordinates are 1-based.
tabix -s1 -b2 -e2 annots.tab.gz
bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,POS,REF,ALT,-,TAG file.vcf
# Annotate from a tab-delimited file with regions (1-based coordinates, inclusive)
tabix -s1 -b2 -e3 annots.tab.gz
bcftools annotate -a annots.tab.gz -h annots.hdr -c CHROM,FROM,TO,TAG input.vcf
# Annotate from a bed file (0-based coordinates, half-closed, half-open intervals)
bcftools annotate -a annots.bed.gz -h annots.hdr -c CHROM,FROM,TO,TAG input.vcf
# Transfer the INFO/END tag, matching by POS,REF,ALT and ID. This example assumes
# that INFO/END is already present in the VCF header.
bcftools annotate -a annots.tab.gz -c CHROM,POS,~ID,REF,ALT,INFO/END input.vcf
# For (many) more examples see http://samtools.github.io/bcftools/howtos/annotate.html
bcftools call [OPTIONS] FILE¶
This command replaces the former bcftools view caller. Some of the original functionality has been temporarily lost in the process of transition under htslib <http://github.com/samtools/htslib>, but will be added back on popular demand. The original calling model can be invoked with the -c option.
File format options:¶
--no-version
-o, --output FILE
-O, --output-type b|u|z|v[0-9]
--ploidy ASSEMBLY[?]
--ploidy-file FILE
X 1 60000 M 1
X 2699521 154931043 M 1
Y 1 59373566 M 1
Y 1 59373566 F 0
MT 1 16569 M 1
MT 1 16569 F 1
* * * M 2
* * * F 2
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
--regions-overlap 0|1|2
-s, --samples LIST
-S, --samples-file FILE
-t, --targets LIST
-T, --targets-file FILE
--targets-overlap 0|1|2
--threads INT
-W[FMT], -W[=FMT], --write-index[=FMT]
Input/output options:¶
-A, --keep-alts
-*, --keep-unseen-allele
-f, --format-fields list
-F, --prior-freqs AN,AC
# Extract AN,AC values from an existing VCF, such 1000Genomes
bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%AN\t%AC\n' 1000Genomes.bcf | bgzip -c > AFs.tab.gz
# If the tags AN,AC are not already present, use the +fill-tags plugin
bcftools +fill-tags 1000Genomes.bcf | bcftools query -f'%CHROM\t%POS\t%REF\t%ALT\t%AN\t%AC\n' | bgzip -c > AFs.tab.gz
tabix -s1 -b2 -e2 AFs.tab.gz
# Create a VCF header description, here we name the tags REF_AN,REF_AC
cat AFs.hdr
##INFO=<ID=REF_AN,Number=1,Type=Integer,Description="Total number of alleles in reference genotypes">
##INFO=<ID=REF_AC,Number=A,Type=Integer,Description="Allele count in reference genotypes for each ALT allele">
# Now before calling, stream the raw mpileup output through `bcftools annotate` to add the frequencies
bcftools mpileup [...] -Ou | bcftools annotate -a AFs.tab.gz -h AFs.hdr -c CHROM,POS,REF,ALT,REF_AN,REF_AC -Ou | bcftools call -mv -F REF_AN,REF_AC [...]
-G, --group-samples FILE|-
-g, --gvcf INT
-i, --insert-missed INT
-M, --keep-masked-ref
-V, --skip-variants snps|indels
-v, --variants-only
Consensus/variant calling options:¶
-c, --consensus-caller
-C, --constrain alleles|trio
alleles
trio
-m, --multiallelic-caller
-n, --novel-rate float[,...]
-p, --pval-threshold float
-P, --prior float
-t, --targets file|chr|chr:pos|chr:from-to|chr:from-[,...]
-X, --chromosome-X
-Y, --chromosome-Y
bcftools cnv [OPTIONS] FILE¶
Copy number variation caller, requires a VCF annotated with the Illumina’s B-allele frequency (BAF) and Log R Ratio intensity (LRR) values. The HMM considers the following copy number states: CN 2 (normal), 1 (single-copy loss), 0 (complete loss), 3 (single-copy gain).
General Options:¶
-c, --control-sample string
-f, --AF-file file
-o, --output-dir path
-p, --plot-threshold float
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
--regions-overlap 0|1|2
-s, --query-sample string
-t, --targets LIST
-T, --targets-file FILE
--targets-overlap 0|1|2
HMM Options:¶
-a, --aberrant float[,float]
-b, --BAF-weight float
-d, --BAF-dev float[,float]
-e, --err-prob float
-l, --LRR-weight float
-L, --LRR-smooth-win int
-O, --optimize float
-P, --same-prob float
-x, --xy-prob float
bcftools concat [OPTIONS] FILE1 FILE2 [...]¶
Concatenate or combine VCF/BCF files. All source files must have the same sample columns appearing in the same order. Can be used, for example, to concatenate chromosome VCFs into one VCF, or combine a SNP VCF and an indel VCF into one. The input files must be sorted by chr and position. The files must be given in the correct order to produce sorted VCF on output unless the -a, --allow-overlaps option is specified. With the --naive option, the files are concatenated without being recompressed, which is very fast..
-a, --allow-overlaps
-c, --compact-PS
-d, --rm-dups snps|indels|both|all|exact
In other words, the default behavior of the program is similar to unix "cat" in that when two files contain a record with the same position, that position will appear twice on output. With -d, every line that finds a matching record in another file will be printed only once.
Requires -a, --allow-overlaps.
-D, --remove-duplicates
-f, --file-list FILE
-l, --ligate
--ligate-force
--ligate-warn
--no-version
-n, --naive
--naive-force
-o, --output FILE
-O, --output-type b|u|z|v[0-9]
-q, --min-PQ INT
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file FILE
--regions-overlap 0|1|2
--threads INT
-W[FMT], -W[=FMT], --write-index[=FMT]
bcftools consensus [OPTIONS] FILE¶
Create consensus sequence by applying VCF variants to a reference fasta file. By default, the program will apply all ALT variants to the reference fasta to obtain the consensus sequence. Using the --sample (and, optionally, --haplotype) option will apply genotype (haplotype) calls from FORMAT/GT. Note that the program does not act as a primitive variant caller and ignores allelic depth information, such as INFO/AD or FORMAT/AD. For that, consider using the setGT plugin.
-a, --absent CHAR
-c, --chain FILE
-e, --exclude EXPRESSION
-f, --fasta-ref FILE
-H, --haplotype N|R|A|I|LR|LA|SR|SA|NpIu
N
R
A
I
LR, LA
SR, SA
NpIu
Note that the -H, --haplotype option requires the -s, --samples option, unless exactly one sample is present in the VCF
-i, --include EXPRESSION
-I, --iupac-codes
--mark-del CHAR
--mark-ins uc|lc|CHAR
--mark-snv uc|lc
-m, --mask FILE
--mask-with CHAR|lc|uc
-M, --missing CHAR
-o, --output FILE
--regions-overlap 0|1|2
-s, --samples LIST
-S, --samples-file FILE
Examples:
# Apply variants present in sample "NA001", output IUPAC codes for hets
bcftools consensus -i -s NA001 -f in.fa in.vcf.gz > out.fa
# Create consensus for one region. The fasta header lines are then expected
# in the form ">chr:from-to". Ignore samples and consider only the REF and ALT columns
samtools faidx ref.fa 8:11870-11890 | bcftools consensus -s - in.vcf.gz -o out.fa
# For more examples see http://samtools.github.io/bcftools/howtos/consensus-sequence.html
Notes:
bcftools convert [OPTIONS] FILE¶
VCF input options:¶
-e, --exclude EXPRESSION
-i, --include EXPRESSION
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file FILE
--regions-overlap 0|1|2
-s, --samples LIST
-S, --samples-file FILE
-t, --targets LIST
-T, --targets-file FILE
--targets-overlap 0|1|2
-W[FMT], -W[=FMT], --write-index[=FMT]
VCF output options:¶
--no-version
-o, --output FILE
-O, --output-type b|u|z|v[0-9]
--threads INT
GEN/SAMPLE conversion:¶
-G, --gensample2vcf prefix or gen-file,sample-file
When the --vcf-ids option is given, the other column (autodetected) is used to fill the ID column of the VCF.
See also -g and --3N6 options.
-g, --gensample prefix or gen-file,sample-file
See also -G and --3N6 options.
The file .gen and .sample file format are:
.gen (with --3N6 --vcf-ids)
---------------------------
chr1 1:111485207_G_A rsID1 111485207 G A 0 1 0 0 1 0
chr1 1:111494194_C_T rsID2 111494194 C T 0 1 0 0 0 1
.gen (with --vcf-ids)
---------------------------
1:111485207_G_A rsID1 111485207 G A 0 1 0 0 1 0
1:111494194_C_T rsID2 111494194 C T 0 1 0 0 0 1
.gen (the default)
------------------------------
1:111485207_G_A 1:111485207_G_A 111485207 G A 0 1 0 0 1 0
1:111494194_C_T 1:111494194_C_T 111494194 C T 0 1 0 0 0 1
.sample
-------
ID_1 ID_2 missing
0 0 0
sample1 sample1 0
sample2 sample2 0
--3N6
--tag STRING
--sex FILE
MaleSample M
FemaleSample F
--vcf-ids
gVCF conversion:¶
--gvcf2vcf
-f, --fasta-ref file
HAP/SAMPLE conversion:¶
--hapsample2vcf prefix or hap-file,sample-file
.hap (with --vcf-ids)
---------------------
1:111485207_G_A rsID1 111485207 G A 0 1 0 0
1:111495231_A_<DEL>_111495784 rsID3 111495231 A <DEL> 0 0 1 0
.hap (the default)
------------------
1 1:111485207_G_A 111485207 G A 0 1 0 0
1 1:111495231_A_<DEL>_111495784 111495231 A <DEL> 0 0 1 0
--hapsample prefix or hap-file,sample-file
--haploid2diploid
--sex FILE
MaleSample M
FemaleSample F
--vcf-ids
HAP/LEGEND/SAMPLE conversion:¶
-H, --haplegendsample2vcf prefix or hap-file,legend-file,sample-file
-h, --haplegendsample prefix or hap-file,legend-file,sample-file
.hap
-----
0 1 0 0 1 0
0 1 0 0 0 1
.legend
-------
id position a0 a1
1:111485207_G_A 111485207 G A
1:111494194_C_T 111494194 C T
.sample
-------
sample population group sex
sample1 sample1 sample1 2
sample2 sample2 sample2 2
--haploid2diploid
--sex FILE
MaleSample M
FemaleSample F
--vcf-ids
TSV conversion:¶
--tsv2vcf file
-c, --columns list
-f, --fasta-ref file
-s, --samples LIST
-S, --samples-file FILE
Example:
# Convert 23andme results into VCF bcftools convert -c ID,CHROM,POS,AA -s SampleName -f 23andme-ref.fa --tsv2vcf 23andme.txt -o out.vcf.gz # Convert tab-delimited file into a sites-only VCF (no genotypes), in this example first column to be ignored bcftools convert -c -,CHROM,POS,REF,ALT -f ref.fa --tsv2vcf calls.txt -o out.bcf
bcftools csq [OPTIONS] FILE¶
Haplotype aware consequence predictor which correctly handles combined variants such as MNPs split over multiple VCF records, SNPs separated by an intron (but adjacent in the spliced transcript) or nearby frame-shifting indels which in combination in fact are not frame-shifting.
The output VCF is annotated with INFO/BCSQ and FORMAT/BCSQ tag (configurable with the -c option). The latter is a bitmask of indexes to INFO/BCSQ, with interleaved haplotypes. See the usage examples below for using the %TBCSQ converter in query for extracting a more human readable form from this bitmask. The construction of the bitmask limits the number of consequences that can be referenced per sample in the FORMAT/BCSQ tags. By default this is 15, but if more are required, see the --ncsq option.
Note that the program annotates only records with a functional consequence and intergenic regions will pass through unchanged.
The program requires on input a VCF/BCF file, the reference genome in fasta format (--fasta-ref) and genomic features in the GFF3 format downloadable from the Ensembl website (--gff-annot), and outputs an annotated VCF/BCF file. Currently, only Ensembl GFF3 files are supported.
By default, the input VCF should be phased. If phase is unknown, or only partially known, the --phase option can be used to indicate how to handle unphased data. Alternatively, haplotype aware calling can be turned off with the --local-csq option.
If conflicting (overlapping) variants within one haplotype are detected, a warning will be emitted and predictions will be based on only the first variant in the analysis.
Symbolic alleles are not supported. They will remain unannotated in the output VCF and are ignored for the prediction analysis.
-c, --custom-tag STRING
-B, --trim-protein-seq INT
--dump-gff FILE
-e, --exclude EXPRESSION
-f, --fasta-ref FILE
--force
-g, --gff-annot FILE
# The program looks for "CDS", "exon", "three_prime_UTR" and "five_prime_UTR" lines,
# looks up their parent transcript (determined from the "Parent=transcript:" attribute),
# the gene (determined from the transcript's "Parent=gene:" attribute), and the biotype
# (the most interesting is "protein_coding").
#
# Empty and commented lines are skipped, the following GFF columns are required
# 1. chromosome
# 2. IGNORED
# 3. type (CDS, exon, three_prime_UTR, five_prime_UTR, gene, transcript, etc.)
# 4. start of the feature (1-based)
# 5. end of the feature (1-based)
# 6. IGNORED
# 7. strand (+ or -)
# 8. phase (0, 1, 2 or .)
# 9. semicolon-separated attributes (see below)
#
# Attributes required for
# gene lines:
# - ID=gene:<gene_id>
# - biotype=<biotype>
# - Name=<gene_name> [optional]
#
# transcript lines:
# - ID=transcript:<transcript_id>
# - Parent=gene:<gene_id>
# - biotype=<biotype>
#
# other lines (CDS, exon, five_prime_UTR, three_prime_UTR):
# - Parent=transcript:<transcript_id>
#
# Supported biotypes:
# - see the function gff_parse_biotype() in bcftools/csq.c
1 ignored_field gene 21 2148 . - . ID=gene:GeneId;biotype=protein_coding;Name=GeneName
1 ignored_field transcript 21 2148 . - . ID=transcript:TranscriptId;Parent=gene:GeneId;biotype=protein_coding
1 ignored_field three_prime_UTR 21 2054 . - . Parent=transcript:TranscriptId
1 ignored_field exon 21 2148 . - . Parent=transcript:TranscriptId
1 ignored_field CDS 21 2148 . - 1 Parent=transcript:TranscriptId
1 ignored_field five_prime_UTR 210 2148 . - . Parent=transcript:TranscriptId
-i, --include EXPRESSION
-l, --local-csq
-n, --ncsq INT
--no-version
-o, --output FILE
-O, --output-type t|b|u|z|v[0-9]
-p, --phase a|m|r|R|s
a
m
r
R
s
-q, --quiet
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file FILE
--regions-overlap 0|1|2
-s, --samples LIST
-S, --samples-file FILE
-t, --targets LIST
-T, --targets-file FILE
--targets-overlap 0|1|2
--unify-chr-names 0|1
-W[FMT], -W[=FMT], --write-index[=FMT]
Examples:
# Basic usage
bcftools csq -f hs37d5.fa -g Homo_sapiens.GRCh37.82.gff3.gz in.vcf -Ob -o out.bcf
# Extract the translated haplotype consequences. The following TBCSQ variations
# are recognised:
# %TBCSQ .. print consequences in all haplotypes in separate columns
# %TBCSQ{0} .. print the first haplotype only
# %TBCSQ{1} .. print the second haplotype only
# %TBCSQ{*} .. print a list of unique consequences present in either haplotype
bcftools query -f'[%CHROM\t%POS\t%SAMPLE\t%TBCSQ\n]' out.bcf
Examples of BCSQ annotation:
# Two separate VCF records at positions 2:122106101 and 2:122106102
# change the same codon. This UV-induced C>T dinucleotide mutation
# has been annotated fully at the position 2:122106101 with
# - consequence type
# - gene name
# - ensembl transcript ID
# - coding strand (+ fwd, - rev)
# - amino acid position (in the coding strand orientation)
# - list of corresponding VCF variants
# The annotation at the second position gives the position of the full
# annotation
BCSQ=missense|CLASP1|ENST00000545861|-|1174P>1174L|122106101G>A+122106102G>A
BCSQ=@122106101
# A frame-restoring combination of two frameshift insertions C>CG and T>TGG
BCSQ=@46115084
BCSQ=inframe_insertion|COPZ2|ENST00000006101|-|18AGRGP>18AQAGGP|46115072C>CG+46115084T>TGG
# Stop gained variant
BCSQ=stop_gained|C2orf83|ENST00000264387|-|141W>141*|228476140C>T
# The consequence type of a variant downstream from a stop are prefixed with *
BCSQ=*missense|PER3|ENST00000361923|+|1028M>1028T|7890117T>C
Supported consequence types
3_prime_utr 5_prime_utr coding_sequence feature_elongation frameshift inframe_altering inframe_deletion inframe_insertion intergenic intron missense non_coding splice_acceptor splice_donor splice_region start_lost start_retained stop_gained stop_lost stop_retained synonymous
See also <https://ensembl.org/info/genome/variation/prediction/predicted_data.html>
bcftools filter [OPTIONS] FILE¶
Apply fixed-threshold filters.
-e, --exclude EXPRESSION
-g, --SnpGap INT[:'indel',mnp,bnd,other,overlap]
The SNPs at positions 1 and 7 are filtered, positions 0 and 8 are not:
0123456789
ref .G.GT..G..
del .A.G-..A.. Here the positions 1 and 6 are filtered, 0 and 7 are not:
0123-456789
ref .G.G-..G..
ins .A.GT..A..
-G, --IndelGap INT
The second indel is filtered:
012345678901
ref .GT.GT..GT..
del .G-.G-..G-.. And similarly here, the second is filtered:
01 23 456 78
ref .A-.A-..A-..
ins .AT.AT..AT..
-i, --include EXPRESSION
--mask [^]REGION
-M, --mask-file [^]FILE
--mask-overlap 0|1|2
-m, --mode [+x]
--no-version
-o, --output FILE
-O, --output-type b|u|z|v[0-9]
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
--regions-overlap 0|1|2
-s, --soft-filter STRING|+
-S, --set-GTs .|0
-t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
-T, --targets-file file
--targets-overlap 0|1|2
--threads INT
-W[FMT], -W[=FMT], --write-index[=FMT]
bcftools gtcheck [OPTIONS] [-g genotypes.vcf.gz] query.vcf.gz¶
Checks sample identity. The program can operate in two modes. If the -g option is given, the identity of samples from query.vcf.gz is checked against the samples in the -g file. Without the -g option, multi-sample cross-check of samples in query.vcf.gz is performed.
Note that the interpretation of the discordance score depends on the options provided (specifically -e and -u) and on the available annotations (FORMAT/PL vs FORMAT/GT). The discordance score can be interpreted as the number of mismatching genotypes if only GT-vs-GT matching is performed.
--distinctive-sites NUM[,MEM[,DIR]]
--dry-run
-e, --exclude [qry|gt]:'EXPRESSION'
-E, --error-probability INT
If -E is set to 0, the discordance score can be interpreted as the number of mismatching genotypes, but only in the GT-vs-GT matching mode. See the -u, --use option below for additional notes and caveats.
If performance is an issue, set -E 0 for faster run times but less accurate results.
Note that in previous versions of bcftools (⇐1.18), this option used to be a smaller case -e. It changed to make room for the filtering option -e, --exclude to stay consistent across other commands.
-g, --genotypes FILE
-H, --homs-only
-i, --include [qry|gt]:'EXPRESSION'
--n-matches INT
--no-HWE-prob
-o, --output FILE
-O, --output-type t|z
-p, --pairs LIST
-P, --pairs-file FILE
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
*-R, --regions-file' FILE
--regions-overlap 0|1|2
-s, --samples [qry|gt]:'LIST': List of query samples or -g samples. If neither -s nor -S are given, all possible sample pair combinations are compared
-S, --samples-file [qry|gt]:'FILE' File with the query or -g samples to compare. If neither -s nor -S are given, all possible sample pair combinations are compared
-t, --targets file
-T, --targets-file file
--targets-overlap 0|1|2
-u, --use TAG1[,TAG2]
Note that when the requested tag is not available, the program will attempt to use the other tag. The output includes the number of sites that were matched by the four possible modes (for example GT-vs-GT or GT-vs-PL).
Examples:
# Check discordance of all samples from B against all samples in A
bcftools gtcheck -g A.bcf B.bcf
# Limit comparisons to the given list of samples
bcftools gtcheck -s gt:a1,a2,a3 -s qry:b1,b2 -g A.bcf B.bcf
# Compare only two pairs a1,b1 and a1,b2
bcftools gtcheck -p a1,b1,a1,b2 -g A.bcf B.bcf
bcftools head [OPTIONS] [FILE]¶
By default, prints all headers from the specified input file to standard output in VCF format. The input file may be in VCF or BCF format; if no FILE is specified, standard input will be read. With appropriate options, only some of the headers and/or additionally some of the variant records will be printed.
The bcftools head command outputs VCF headers almost exactly as they appear in the input file: it may add a ##FILTER=<ID=PASS> header if not already present, but it never adds version or command line information itself.
Options:¶
-h, --header INT
-n, --records INT
-s, --samples INT
bcftools index [OPTIONS] in.bcf|in.vcf.gz¶
Creates index for bgzip compressed VCF/BCF files for random access. CSI (coordinate-sorted index) is created by default. The CSI format supports indexing of chromosomes up to length 2^31. TBI (tabix index) index files, which support chromosome lengths up to 2^29, can be created by using the -t/--tbi option or using the tabix program packaged with htslib. When loading an index file, bcftools will try the CSI first and then the TBI.
Indexing options:¶
-c, --csi
-f, --force
-m, --min-shift INT
-o, --output FILE
-t, --tbi
--threads INT
Stats options:¶
-a, --all
-n, --nrecords
-s, --stats
bcftools isec [OPTIONS] A.vcf.gz B.vcf.gz [...]¶
Creates intersections, unions and complements of VCF files. Depending on the options, the program can output records from one (or more) files which have (or do not have) corresponding records with the same position in the other files.
-c, --collapse snps|indels|both|all|some|none
-C, --complement
-e, --exclude -|EXPRESSION
-f, --apply-filters LIST
-i, --include EXPRESSION
-f, --file-list FILE
-n, --nfiles [+-=]INT|~BITMAP
-o, --output FILE
-O, --output-type b|u|z|v[0-9]
-p, --prefix DIR
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
--regions-overlap 0|1|2
-t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
-T, --targets-file file
--targets-overlap 0|1|2
-w, --write LIST
-W[FMT], -W[=FMT], --write-index[=FMT]
Examples:¶
Create intersection and complements of two sets saving the output in dir/*
bcftools isec -p dir A.vcf.gz B.vcf.gz
Filter sites in A (require INFO/MAF>=0.01) and B (require INFO/dbSNP) but not in C, and create an intersection, including only sites which appear in at least two of the files after filters have been applied
bcftools isec -e'MAF<0.01' -i'dbSNP=1' -e- A.vcf.gz B.vcf.gz C.vcf.gz -n +2 -p dir
Extract and write records from A shared by both A and B using exact allele match
bcftools isec -p dir -n=2 -w1 A.vcf.gz B.vcf.gz
Extract records private to A or B comparing by position only
bcftools isec -p dir -n-1 -c all A.vcf.gz B.vcf.gz
Print a list of records which are present in A and B but not in C and D
bcftools isec -n~1100 -c all A.vcf.gz B.vcf.gz C.vcf.gz D.vcf.gz
bcftools merge [OPTIONS] A.vcf.gz B.vcf.gz [...]¶
Merge multiple VCF/BCF files from non-overlapping sample sets to create one multi-sample file. For example, when merging file A.vcf.gz containing samples S1, S2 and S3 and file B.vcf.gz containing samples S3 and S4, the output file will contain five samples named S1, S2, S3, 2:S3 and S4.
Note that it is responsibility of the user to ensure that the sample names are unique across all files. If they are not, the program will exit with an error unless the option --force-samples is given. The sample names can be also given explicitly using the --print-header and --use-header options.
Note that only records from different files can be merged, never from the same file. For "vertical" merge take a look at bcftools concat or bcftools norm -m instead.
--force-no-index
--force-samples
--force-single
--print-header
--use-header FILE
-0 --missing-to-ref
-f, --apply-filters LIST
-F, --filter-logic x|+
-g, --gvcf -|FILE
-i, --info-rules -|TAG:METHOD[,...]
-l, --file-list FILE
-L, --local-alleles INT
-m, --merge snps|indels|both|snp-ins-del|all|none|id[,*]
-m none .. no new multiallelics, output multiple records instead -m snps .. allow multiallelic SNP records -m indels .. allow multiallelic indel records -m both .. both SNP and indel records can be multiallelic -m both,* .. same as above but remove <*> (or <NON_REF>) from variant sites -m both,** .. same as above but remove <*> (or <NON_REF>) at all sites -m all .. SNP records can be merged with indel records -m snp-ins-del .. allow multiallelic SNVs, insertions, deletions, but don't mix them -m id .. merge by ID
-M, --missing-rules -|TAG:METHOD[,...]
--no-index
--no-version
-o, --output FILE
-O, --output-type b|u|z|v[0-9]
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
--regions-overlap 0|1|2
--threads INT
-W[FMT], -W[=FMT], --write-index[=FMT]
bcftools mpileup [OPTIONS] -f ref.fa in.bam [in2.bam [...]]¶
Generate VCF or BCF containing genotype likelihoods for one or multiple alignment (BAM or CRAM) files. This is based on the original samtools mpileup command (with the -v or -g options) producing genotype likelihoods in VCF or BCF format, but not the textual pileup output. The mpileup command was transferred to bcftools in order to avoid errors resulting from use of incompatible versions of samtools and bcftools when using in the mpileup+bcftools call pipeline.
Individuals are identified from the SM tags in the @RG header lines. Multiple individuals can be pooled in one alignment file, also one individual can be separated into multiple files. If sample identifiers are absent, each input file is regarded as one sample.
Note that there are two orthogonal ways to specify locations in the input file; via -r region and -t positions. The former uses (and requires) an index to do random access while the latter streams through the file contents filtering out the specified regions, requiring no index. The two may be used in conjunction. For example a BED file containing locations of genes in chromosome 20 could be specified using -r 20 -t chr20.bed, meaning that the index is used to find chromosome 20 and then it is filtered for the regions listed in the BED file. Also note that the -r option can be much slower than -t with many regions and can require more memory when multiple regions and many alignment files are processed.
Input options¶
-6, --illumina1.3+
-A, --count-orphans
-b, --bam-list FILE
-B, --no-BAQ
-C, --adjust-MQ INT
-D, --full-BAQ
By default mpileup uses heuristics to decide when to apply the BAQ algorithm. Most sequences will not be BAQ adjusted, giving a CPU time closer to --no-BAQ, but it will still be applied in regions with suspected problematic alignments. This has been tested to work well on single sample data with even allele frequency, but the reliability is unknown for multi-sample calling and for low allele frequency variants so full BAQ is still recommended in those scenarios.
-d, --max-depth INT
-E, --redo-BAQ
-f, --fasta-ref FILE
--no-reference
-G, --read-groups FILE
RG_ID_1
RG_ID_2 SAMPLE_A
RG_ID_3 SAMPLE_A
RG_ID_4 SAMPLE_B
RG_ID_5 FILE_1.bam SAMPLE_A
RG_ID_6 FILE_2.bam SAMPLE_A
* FILE_3.bam SAMPLE_C
? FILE_3.bam SAMPLE_D
--indels-2.0
--indels-cns
--no-indels-cns
-q, -min-MQ INT
-Q, --min-BQ INT
--max-BQ INT
-r, --regions CHR|CHR:POS|CHR:FROM-TO|CHR:FROM-[,...]
-R, --regions-file FILE
--regions-overlap 0|1|2
--ignore-RG
--ls, --skip-all-set
--ns, --skip-any-set
--lu, --skip-all-unset
--nu, --skip-any-unset
-s, --samples LIST
-S, --samples-file FILE
-t, --targets LIST
-T, --targets-file FILE
--targets-overlap 0|1|2
-x, --ignore-overlaps
--seed INT
Output options¶
-a, --annotate LIST
FORMAT/AD .. Allelic depth (Number=R,Type=Integer) FORMAT/ADF .. Allelic depths on the forward strand (Number=R,Type=Integer) FORMAT/ADR .. Allelic depths on the reverse strand (Number=R,Type=Integer) FORMAT/DP .. Number of high-quality bases (Number=1,Type=Integer) FORMAT/SP .. Phred-scaled strand bias P-value (Number=1,Type=Integer) FORMAT/SCR .. Number of soft-clipped reads (Number=1,Type=Integer) INFO/AD .. Total allelic depth (Number=R,Type=Integer) INFO/ADF .. Total allelic depths on the forward strand (Number=R,Type=Integer) INFO/ADR .. Total allelic depths on the reverse strand (Number=R,Type=Integer) INFO/SCR .. Number of soft-clipped reads (Number=1,Type=Integer) FORMAT/DV .. Deprecated in favor of FORMAT/AD; Number of high-quality non-reference bases, (Number=1,Type=Integer) FORMAT/DP4 .. Deprecated in favor of FORMAT/ADF and FORMAT/ADR; Number of high-quality ref-forward, ref-reverse,
alt-forward and alt-reverse bases (Number=4,Type=Integer) FORMAT/DPR .. Deprecated in favor of FORMAT/AD; Number of high-quality bases for each observed allele (Number=R,Type=Integer) INFO/DPR .. Deprecated in favor of INFO/AD; Number of high-quality bases for each observed allele (Number=R,Type=Integer)
-g, --gvcf INT[,...]
--no-version
-o, --output FILE
-O, --output-type b|u|z|v[0-9]
--threads INT
-U, --mwu-u
This option changes the INFO field names produced back to the ones used by the earlier Bcftools releases. For excample BQBZ becomes BQB.
-W[FMT], -W[=FMT], --write-index[=FMT]
Options for SNP/INDEL genotype likelihood computation¶
-X, --config STR
1.12 -Q13 -h100 -m1
bgi bgi-1.20 --indels-cns -B --indel-size 80 -F0.1 --indel-bias 0.9
--seqq-offset 120
illumina-1.18 [ default values ]
illumina illumina-1.20 --indels-cns --seqq-offset 125
ont -B -Q5 --max-BQ 30 -I
ont-sup ont-sup-1.20 --indels-cns -B -Q1 --max-BQ 35 --delta-BQ 99 -F0.2
-o15 -e1 -h110 --del-bias 0.4 --indel-bias 0.7
--poly-mqual --seqq-offset 130 --indel-size 80
pacbio-ccs-1.18 -D -Q5 --max-BQ 50 -F0.1 -o25 -e1 -M99999
pacbio-ccs pacbio-ccs-1.20 --indels-cns -B -Q5 --max-BQ 50 -F0.1 -o25 -e1 -h300
--delta-BQ 10 --del-bias 0.4 --poly-mqual
--indel-bias 0.9 --seqq-offset 118 --indel-size 80
--score-vs-ref 0.7
ultima ultima-1.20 --indels-cns -B -Q1 --max-BQ 30 --delta-BQ 10 -F0.15
-o20 -e10 -h250 --del-bias 0.3 --indel-bias 0.7
--poly-mqual --seqq-offset 140 --score-vs-ref 0.3
--indel-size 80
--ar, --ambig-reads drop|incAD|incAD0
-e, --ext-prob INT
-F, --gap-frac FLOAT
-h, --tandem-qual INT
--indel-bias FLOAT
--del-bias FLOAT
--indel-size INT
--seqq-offset INT
--poly-mqual
-I, --skip-indels
-L, --max-idepth INT
-m, --min-ireads INT
-M, --max-read-len INT
-o, --open-prob INT
-p, --per-sample-mF
-P, --platforms STR
Examples:¶
Call SNPs and short INDELs, then mark low quality sites and sites with the read depth exceeding a limit. (The read depth should be adjusted to about twice the average read depth as higher read depths usually indicate problematic regions which are often enriched for artefacts.) One may consider to add -C50 to mpileup if mapping quality is overestimated for reads containing excessive mismatches. Applying this option usually helps for BWA-backtrack alignments, but may not other aligners.
bcftools mpileup -Ou -f ref.fa aln.bam | \
bcftools call -Ou -mv | \
bcftools filter -s LowQual -e '%QUAL<20 || DP>100' > var.flt.vcf
bcftools norm [OPTIONS] file.vcf.gz¶
Left-align and normalize indels, check if REF alleles match the reference, split multiallelic sites into multiple rows; recover multiallelics from multiple rows. Left-alignment and normalization will only be applied if the --fasta-ref option is supplied.
-a, --atomize
--atom-overlaps .|*
# Before atomization:
100 CC C,GG 1/2
# After:
# bcftools norm -a --atom-overlaps .
100 C G ./1
100 CC C 1/.
101 C G ./1
# After:
# bcftools norm -a --atom-overlaps '*'
# bcftools norm -a --atom-overlaps \*
100 C G,* 2/1
100 CC C,* 1/2
101 C G,* 2/1
-c, --check-ref e|w|x|s
-d, --rm-dup snps|indels|both|all|exact
-D, --remove-duplicates
-f, --fasta-ref FILE
--force
-g, --gff-annot FILE
--keep-sum TAG[,...]
-m, --multiallelics -|+[snps|indels|both|any]
Note that multiallelic sites with both SNPs and indels will be split into biallelic sites with both -m -snps and -m -indels.
--multi-overlaps 0|.
--no-version
-N, --do-not-normalize
--old-rec-tag STR
-o, --output FILE
-O, --output-type b|u|z|v[0-9]
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
--regions-overlap 0|1|2
-s, --strict-filter
-t, --targets LIST
-T, --targets-file FILE
--targets-overlap 0|1|2
--threads INT
-w, --site-win INT
-W[FMT], -W[=FMT], --write-index[=FMT]
bcftools [plugin NAME|+NAME] [OPTIONS] FILE — [PLUGIN OPTIONS]¶
A common framework for various utilities. The plugins can be used the same way as normal commands only their name is prefixed with "+". Most plugins accept two types of parameters: general options shared by all plugins followed by a separator, and a list of plugin-specific options. There are some exceptions to this rule, some plugins do not accept the common options and implement their own parameters. Therefore please pay attention to the usage examples that each plugin comes with.
VCF input options:¶
-e, --exclude EXPRESSION
-i, --include EXPRESSION
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
--regions-overlap 0|1|2
-t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
-T, --targets-file file
--targets-overlap 0|1|2
VCF output options:¶
--no-version
-o, --output FILE
-O, --output-type b|u|z|v[0-9]
--threads INT
-W[FMT], -W[=FMT], --write-index[=FMT]
Plugin options:¶
-h, --help
-l, --list-plugins
By default, appropriate system directories are searched for installed plugins. You can override this by setting the BCFTOOLS_PLUGINS environment variable to a colon-separated list of directories to search. If BCFTOOLS_PLUGINS begins with a colon, ends with a colon, or contains adjacent colons, the system directories are also searched at that position in the list of directories.
-v, --verbose
-V, --version
List of plugins coming with the distribution:¶
ad-bias
add-variantkey
af-dist
allele-length
check-ploidy
check-sparsity
color-chrs
contrast
counts
dosage
fill-from-fasta
fill-tags
fixploidy
fixref
frameshifts
GTisec
GTsubset
guess-ploidy
gvcfz
impute-info
indel-stats
isecGT
mendelian
mendelian2
missing2ref
parental-origin
prune
remove-overlaps
scatter
setGT
smpl-stats
split
split-vep
tag2tag
trio-dnm2
trio-stats
trio-switch-rate
variant-distance
variantkey-hex
Examples:¶
# List options common to all plugins bcftools plugin # List available plugins bcftools plugin -l # Run a plugin bcftools plugin counts in.vcf # Run a plugin using the abbreviated "+" notation bcftools +counts in.vcf # Run a plugin from an explicit location bcftools +/path/to/counts.so in.vcf # The input VCF can be streamed just like in other commands cat in.vcf | bcftools +counts # Print usage information of plugin "dosage" bcftools +dosage -h # Replace missing genotypes with 0/0 bcftools +missing2ref in.vcf # Replace missing genotypes with 0|0 bcftools +missing2ref in.vcf -- -p
Plugins troubleshooting:¶
Things to check if your plugin does not show up in the bcftools plugin -l output:
Plugins API:¶
// Short description used by 'bcftools plugin -l' const char *about(void); // Longer description used by 'bcftools +name -h' const char *usage(void); // Called once at startup, allows initialization of local variables. // Return 1 to suppress normal VCF/BCF header output, -1 on critical // errors, 0 otherwise. int init(int argc, char **argv, bcf_hdr_t *in_hdr, bcf_hdr_t *out_hdr); // Called for each VCF record, return NULL to suppress the output bcf1_t *process(bcf1_t *rec); // Called after all lines have been processed to clean up void destroy(void);
bcftools polysomy [OPTIONS] file.vcf.gz¶
Detect number of chromosomal copies in VCFs annotates with the Illumina’s B-allele frequency (BAF) values. Note that this command is not compiled in by default, see the section Optional Compilation with GSL in the INSTALL file for help.
General options:¶
-o, --output-dir path
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
--regions-overlap 0|1|2
-s, --sample string
-t, --targets LIST
-T, --targets-file FILE
--targets-overlap 0|1|2
-v, --verbose
Algorithm options:¶
-b, --peak-size float
-c, --cn-penalty float
-f, --fit-th float
-i, --include-aa
-m, --min-fraction float
-p, --peak-symmetry float
bcftools query [OPTIONS] file.vcf.gz [file.vcf.gz [...]]¶
Extracts fields from VCF or BCF files and outputs them in user-defined format.
-e, --exclude EXPRESSION
--force-samples
-f, --format FORMAT
-F, --print-filtered STR
-H, --print-header
-i, --include EXPRESSION
-l, --list-samples
-N, --disable-automatic-newline
-o, --output FILE
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
--regions-overlap 0|1|2
-s, --samples LIST
-S, --samples-file FILE
-t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
-T, --targets-file file
--targets-overlap 0|1|2
-u, --allow-undef-tags
-v, --vcf-list FILE
Format:¶
%CHROM The CHROM column (similarly also other columns: POS, ID, REF, ALT, QUAL, FILTER) %END End position of the REF allele %END0 End position of the REF allele in 0-based coordinates %FIRST_ALT Alias for %ALT{0} %FORMAT Prints all FORMAT fields or a subset of samples with -s or -S %GT Genotype (e.g. 0/1) %INFO Prints the whole INFO column %INFO/TAG Any tag in the INFO column %IUPACGT Genotype translated to IUPAC ambiguity codes (e.g. M instead of C/A) %LINE Prints the whole line %MASK Indicates presence of the site in other files (with multiple files) %N_PASS(expr) Number of samples that pass the filtering expression (see *<<expressions,EXPRESSIONS>>*) %POS0 POS in 0-based coordinates %PBINOM(TAG) Calculate phred-scaled binomial probability, the allele index is determined from GT %SAMPLE Sample name %TAG{INT} Curly brackets to print a subfield (e.g. INFO/TAG{1}, the indexes are 0-based) %TBCSQ Translated FORMAT/BCSQ. See the csq command above for explanation and examples. %TGT Translated genotype (e.g. C/A) %TYPE Variant type (REF, SNP, MNP, INDEL, BND, OTHER) [] Format fields must be enclosed in brackets to loop over all samples \n new line \t tab character
Everything else is printed verbatim.
Examples:¶
# Print chromosome, position, ref allele and the first alternate allele bcftools query -f '%CHROM %POS %REF %ALT{0}\n' file.vcf.gz
# Similar to above, but use tabs instead of spaces, add sample name and genotype bcftools query -f '%CHROM\t%POS\t%REF\t%ALT[\t%SAMPLE=%GT]\n' file.vcf.gz
# Print FORMAT/GT fields followed by FORMAT/GT fields bcftools query -f 'GQ:[ %GQ] \t GT:[ %GT]\n' file.vcf
# Make a BED file: chr, pos (0-based), end pos (1-based), id bcftools query -f'%CHROM\t%POS0\t%END\t%ID\n' file.bcf
# Print only samples with alternate (non-reference) genotypes bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]' -i'GT="alt"' file.bcf
# Print all samples at sites with at least one alternate genotype bcftools view -i'GT="alt"' file.bcf -Ou | bcftools query -f'[%CHROM:%POS %SAMPLE %GT\n]'
# Print phred-scaled binomial probability from FORMAT/AD tag for all heterozygous genotypes bcftools query -i'GT="het"' -f'[%CHROM:%POS %SAMPLE %GT %PBINOM(AD)\n]' file.vcf
# Print the second value of AC field if bigger than 10. Note the (unfortunate) difference in # index subscript notation: formatting expressions (-f) uses "{}" while filtering expressions # (-i) use "[]". This is for historic reasons and backward-compatibility. bcftools query -f '%AC{1}\n' -i 'AC[1]>10' file.vcf.gz
# Print all samples at sites where at least one sample has DP=1 or DP=2. In the second case # print only samples with DP=1 or DP=2, the difference is in the logical operator used, || vs |. bcftools query -f '[%SAMPLE %GT %DP\n]' -i 'FMT/DP=1 || FMT/DP=2' file.vcf bcftools query -f '[%SAMPLE %GT %DP\n]' -i 'FMT/DP=1 | FMT/DP=2' file.vcf
bcftools reheader [OPTIONS] file.vcf.gz¶
Modify header of VCF/BCF files, change sample names.
-f, --fai FILE
-h, --header FILE
-o, --output FILE
-s, --samples FILE
-T, --temp-prefix PATH
--threads INT
bcftools roh [OPTIONS] file.vcf.gz¶
A program for detecting runs of homo/autozygosity. Only bi-allelic sites are considered.
The HMM model:¶
Notation:
D = Data, AZ = autozygosity, HW = Hardy-Weinberg (non-autozygosity),
f = non-ref allele frequency Emission probabilities:
oAZ = P_i(D|AZ) = (1-f)*P(D|RR) + f*P(D|AA)
oHW = P_i(D|HW) = (1-f)^2 * P(D|RR) + f^2 * P(D|AA) + 2*f*(1-f)*P(D|RA) Transition probabilities:
tAZ = P(AZ|HW) .. from HW to AZ, the -a parameter
tHW = P(HW|AZ) .. from AZ to HW, the -H parameter
ci = P_i(C) .. probability of cross-over at site i, from genetic map
AZi = P_i(AZ) .. probability of site i being AZ/non-AZ, scaled so that AZi+HWi = 1
HWi = P_i(HW)
P_{i+1}(AZ) = oAZ * max[(1 - tAZ * ci) * AZ{i-1} , tAZ * ci * (1-AZ{i-1})]
P_{i+1}(HW) = oHW * max[(1 - tHW * ci) * (1-AZ{i-1}) , tHW * ci * AZ{i-1}]
General Options:¶
--AF-dflt FLOAT
--AF-tag TAG
--AF-file FILE
bcftools query -f'%CHROM\t%POS\t%REF,%ALT\t%INFO/TAG\n' file.vcf | bgzip -c > freqs.tab.gz
-b, --buffer-size INT[,INT]
-e, --estimate-AF FILE
--exclude EXPRESSION
-G, --GTs-only FLOAT
--include EXPRESSION
-I, --skip-indels
-m, --genetic-map FILE
-M, --rec-rate FLOAT
-o, --output FILE
-O, --output-type s|r[z]
# Output fields:
RG = predicted homo/autozygous regions
- Sample
- Chromosome
- Start
- End
- Length (bp)
- Number of markers
- Quality .. average phred score in the region from the forward-backward algorithm
ST = per-site output showing:
- Sample
- Chromosome
- Position
- State .. predicted state from the Viterbi algorithm, 0 for normal (HW, Hardy-Weinberg) or 1 for autozygous (AZ)
- Quality .. quality score from the forward-backward algorithm
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
--regions-overlap 0|1|2
-s, --samples LIST
-S, --samples-file FILE
-t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
-T, --targets-file file
--targets-overlap 0|1|2
HMM Options:¶
-a, --hw-to-az FLOAT
-H, --az-to-hw FLOAT
-V, --viterbi-training FLOAT
bcftools sort [OPTIONS] file.bcf¶
-m, --max-mem FLOAT[kMG]
-o, --output FILE
-O, --output-type b|u|z|v[0-9]
-T, --temp-dir DIR
-W[FMT], -W[=FMT], --write-index[=FMT]
bcftools stats [OPTIONS] A.vcf.gz [B.vcf.gz]¶
Parses VCF or BCF and produces text file stats which is suitable for machine processing and can be plotted using plot-vcfstats. When two files are given, the program generates separate stats for intersection and the complements. By default only sites are compared, -s/-S must given to include also sample columns. When one VCF file is specified on the command line, then stats by non-reference allele frequency, depth distribution, stats by quality and per-sample counts, singleton stats, etc. are printed. When two VCF files are given, then stats such as concordance (Genotype concordance by non-reference allele frequency, Genotype concordance by sample, Non-Reference Discordance) and correlation are also printed. Per-site discordance (PSD) is also printed in --verbose mode.
--af-bins LIST|FILE
--af-tag TAG
-1, --1st-allele-only
-c, --collapse snps|indels|both|all|some|none
-d, --depth INT,INT,INT
--debug
-e, --exclude EXPRESSION
-E, --exons file.gz
tabix -s1 -b2 -e3 file.gz
-f, --apply-filters LIST
-F, --fasta-ref ref.fa
-i, --include EXPRESSION
-I, --split-by-ID
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
--regions-overlap 0|1|2
-s, --samples LIST
-S, --samples-file FILE
-t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
-T, --targets-file file
--targets-overlap 0|1|2
-u, --user-tstv <TAG[:min:max:n]>
-v, --verbose
bcftools view [OPTIONS] file.vcf.gz [REGION [...]]¶
View, subset and filter VCF or BCF files by position and filtering expression. Convert between VCF and BCF. Former bcftools subset.
Output options¶
-G, --drop-genotypes
-h, --header-only
-H, --no-header
--with-header
-l, --compression-level [0-9]
--no-version
-O, --output-type b|u|z|v[0-9]
-o, --output FILE: output file name. If not present, the default is to print to standard output (stdout).
-r, --regions chr|chr:pos|chr:from-to|chr:from-[,...]
-R, --regions-file file
--regions-overlap 0|1|2
-t, --targets chr|chr:pos|chr:from-to|chr:from-[,...]
-T, --targets-file file
--targets-overlap 0|1|2
--threads INT
-W[FMT], -W[=FMT], --write-index[=FMT]
Subset options:¶
-A, --trim-unseen-alleles
-a, --trim-alt-alleles
--force-samples
-I, --no-update
-s, --samples LIST
-S, --samples-file FILE
Filter options:¶
Note that filter options below dealing with counting the number of alleles will, for speed, first check for the values of AC and AN in the INFO column to avoid parsing all the genotype (FORMAT/GT) fields in the VCF. This means that filters like --uncalled, --exclude-uncalled', or --min-af 0.1 will be calculated from INFO/AC and INFO/AN when available or FORMAT/GT otherwise. However, it will not attempt to use any other existing field, like INFO/AF for example. For that, use --exclude AF<0.1 instead.
Also note that one must be careful when sample subsetting and filtering is performed in a single command because the order of internal operations can influence the result. For example, the -i/-e filtering is performed before sample removal, but the -P filtering is performed after, and some are inherently ambiguous, for example allele counts can be taken from the INFO column when present but calculated on the fly when absent. Therefore it is strongly recommended to spell out the required order explicitly by separating such commands into two steps. (Make sure to use the -O u option when piping!)
-c, --min-ac INT[:nref|:alt1|:minor|:major|:'nonmajor']
-C, --max-ac INT[:nref|:alt1|:minor|:'major'|:'nonmajor']
-e, --exclude EXPRESSION
-f, --apply-filters LIST
-g, --genotype [^][hom|het|miss]
-i, --include EXPRESSION
-k, --known
-m, --min-alleles INT
-M, --max-alleles INT
-n, --novel
-p, --phased
-P, --exclude-phased
-q, --min-af FLOAT[:nref|:alt1|:minor|:major|:nonmajor]
-Q, --max-af FLOAT[:nref|:alt1|:minor|:major|:nonmajor]
-u, --uncalled
-U, --exclude-uncalled
-v, --types snps|indels|mnps|other
-V, --exclude-types snps|indels|mnps|ref|bnd|other
-x, --private
-X, --exclude-private
bcftools help [COMMAND] | bcftools --help [COMMAND]¶
Display a brief usage message listing the bcftools commands available. If the name of a command is also given, e.g., bcftools help view, the detailed usage message for that particular command is displayed.
bcftools [--version|-v]¶
Display the version numbers and copyright information for bcftools and the important libraries used by bcftools.
bcftools [--version-only]¶
Display the full bcftools version number in a machine-readable format.
SCRIPTS¶
gff2gff¶
Attempts to fix a GFF file to be correctly parsed by csq.
zcat in.gff.gz | gff2gff | gzip -c > out.gff.gz
plot-vcfstats [OPTIONS] file.vchk [...]¶
Script for processing output of bcftools stats. It can merge results from multiple outputs (useful when running the stats for each chromosome separately), plots graphs and creates a PDF presentation.
-m, --merge
-p, --prefix DIR
-P, --no-PDF
-r, --rasterize
-s, --sample-names
-t, --title STRING
-v, --vectors
-T, --main-title STRING
Example:
# Generate the stats bcftools stats -s - > file.vchk
# Plot the stats plot-vcfstats -p outdir file.vchk
# The final looks can be customized by editing the generated # 'outdir/plot.py' script and re-running manually cd outdir && python plot.py && pdflatex summary.tex
FILTERING EXPRESSIONS¶
These filtering expressions are accepted by most of the commands.
Valid expressions may contain:
1, 1.0, 1e-4 "String" @file_name
+, *, -, /, %
== (same as =), >, >=, <=, <, !=
INFO/HAYSTACK ~ "needle" INFO/HAYSTACK ~ "NEEDless/i"
(, )
&&, &, ||, |
INFO/DP or DP FORMAT/DV, FMT/DV, or DV FILTER, QUAL, ID, CHROM, POS, REF, ALT[0]
FILTER="PASS" FILTER="." FILTER="A" .. exact match, for example "A;B" does not pass FILTER="A;B" .. exact match, "A;B" and "B;A" pass, everything else fails FILTER!="A" .. exact match, for example "A;B" does pass FILTER~"A" .. subset match, for example both "A" and "A;B" pass FILTER~"A;B" .. subset match, pass only if both "A" and "B" are present FILTER!~"A" .. complement match, for example both "A" and "A;B" fail FILTER!~"A;B" .. complement match, fail if both "A" and "B" are present
FlagA=1 && FlagB=0
DP=".", DP!=".", ALT="."
GT="mis", GT~"\.", GT!~"\."
GT=".|.", GT="./.", GT="."
GT="ref" GT="alt" GT="mis" GT="hom" GT="het" GT="hap" GT="RR" GT="AA" GT="RA" or GT="AR" GT="Aa" or GT="aA" GT="R" GT="A"
TYPE="snp" TYPE~"snp" TYPE!="snp" TYPE!~"snp"
INFO/AF[0] > 0.3 .. first AF value bigger than 0.3 FORMAT/AD[0:0] > 30 .. first AD value of the first sample bigger than 30 FORMAT/AD[0:1] .. first sample, second AD value FORMAT/AD[1:0] .. second sample, first AD value DP4[*] == 0 .. any DP4 value FORMAT/DP[0] > 30 .. DP of the first sample bigger than 30 FORMAT/DP[1-3] > 10 .. samples 2-4 FORMAT/DP[1-] < 7 .. all samples but the first FORMAT/DP[0,2-4] > 20 .. samples 1, 3-5 FORMAT/AD[0:1] .. first sample, second AD field FORMAT/AD[0:*], AD[0:] or AD[0] .. first sample, any AD field FORMAT/AD[*:1] or AD[:1] .. any sample, second AD field (DP4[0]+DP4[1])/(DP4[2]+DP4[3]) > 0.3 CSQ[*] ~ "missense_variant.*deleterious"
FORMAT/AD[GT] > 10 .. require support of more than 10 reads for each allele FORMAT/AD[0:GT] > 10 .. same as above, but in the first sample sSUM(FORMAT/AD[GT]) > 20 .. require total sample depth bigger than 20
GT[@samples.txt]="het" & binom(AD)<0.01
MAX, MIN, AVG, MEAN, MEDIAN, STDEV, SUM, STRLEN, ABS, COUNT
Note that functions above evaluate to a single value across all samples and are intended to select sites, not samples, even when applied on FORMAT tags. However, when prefixed with SMPL_ (or "s" for brevity, e.g. SMPL_MAX or sMAX), they will evaluate to a vector of per-sample values when applied on FORMAT tags:
SMPL_MAX, SMPL_MIN, SMPL_AVG, SMPL_MEAN, SMPL_MEDIAN, SMPL_STDEV, SMPL_SUM, sMAX, sMIN, sAVG, sMEAN, sMEDIAN, sSTDEV, sSUM
binom(FMT/AD) .. GT can be used to determine the correct index binom(AD[0],AD[1]) .. or the fields can be given explicitly phred(binom()) .. the same as binom but phred-scaled
N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN, N_MISSING, F_MISSING, ILEN
N_PASS(GQ>90 & GT!="mis") > 90 F_PASS(GQ>90 & GT!="mis") > 0.9
perl:path/to/script.pl; perl.severity(INFO/CSQ) > 3
Notes:
-i 'TAG="hello,world"' -i 'TAG="hello" || TAG="world"' -i 'TAG="hello" && TAG="world"'
-i 'TAG[*]=1' .. true, the record will be printed -i 'TAG[*]!=1' .. true -e 'TAG[*]=1' .. false, the record will be discarded -e 'TAG[*]!=1' .. false -i 'TAG[0]=1' .. true -i 'TAG[0]!=1' .. false -e 'TAG[0]=1' .. false -e 'TAG[0]!=1' .. true
Examples:
MIN(DV)>5 .. selects the whole site, evaluates min across all values and samples
SMPL_MIN(DV)>5 .. selects matching samples, evaluates within samples
MIN(DV/DP)>0.3
MIN(DP)>10 & MIN(DV)>3
FMT/DP>10 & FMT/GQ>10 .. both conditions must be satisfied within one sample
FMT/DP>10 && FMT/GQ>10 .. the conditions can be satisfied in different samples
QUAL>10 | FMT/GQ>10 .. true for sites with QUAL>10 or a sample with GQ>10, but selects only samples with GQ>10
QUAL>10 || FMT/GQ>10 .. true for sites with QUAL>10 or a sample with GQ>10, plus selects all samples at such sites
TYPE="snp" && QUAL>=10 && (DP4[2]+DP4[3] > 2)
COUNT(GT="hom")=0 .. no homozygous genotypes at the site
AVG(GQ)>50 .. average (arithmetic mean) of genotype qualities bigger than 50
ID=@file .. selects lines with ID present in the file
ID!=@~/file .. skip lines with ID present in the ~/file
INFO/TAG=@file .. selects lines with INFO/TAG value present in the file
MAF[0]<0.05 .. select rare variants at 5% cutoff
POS>=100 .. restrict your range query, e.g. 20:100-200 to strictly sites with POS in that range.
Shell expansion:
Note that expressions must often be quoted because some characters have special meaning in the shell. An example of expression enclosed in single quotes which cause that the whole expression is passed to the program as intended:
bcftools view -i '%ID!="." & MAF[0]<0.01'
Please refer to the documentation of your shell for details.
TERMINOLOGY¶
The program and the documentation uses the following terminology, multiple terms can be used interchangeably for the same VCF record type
REF ALT --------- C . .. reference allele / non-variant site / ref-only site C T .. SNP or SNV (single-nucleotide polymorphism or variant), used interchangeably CC TT .. MNP (multi-nucleotide polymorphism) CAAA C .. indel, deletion (regardless of length) C CAAA .. indel, insertion (regardless of length) C <*> .. gVCF block, the allele <*> is a placeholder for alternate allele possibly missed because of low coverage C <NON_REF> .. synonymous to <*> C * .. overlapping deletion C <INS> .. symbolic allele, known also as 'other [than above]'
PERFORMANCE¶
HTSlib was designed with BCF format in mind. When parsing VCF files, all records are internally converted into BCF representation. Simple operations, like removing a single column from a VCF file, can be therefore done much faster with standard UNIX commands, such as awk or cut. Therefore it is recommended to use BCF as input/output format whenever possible to avoid large overhead of the VCF → BCF → VCF conversion.
BUGS¶
Please report any bugs you encounter on the github website: <http://github.com/samtools/bcftools>
AUTHORS¶
Heng Li from the Sanger Institute wrote the original C version of htslib, samtools and bcftools. Bob Handsaker from the Broad Institute implemented the BGZF library. Petr Danecek is maintaining and further developing bcftools, together with the rest of the samtools team <https://www.sanger.ac.uk/tool/samtools-bcftools-htslib>. Many other people contributed to the program and to the file format specifications, both directly and indirectly by providing patches, testing and reporting bugs. We thank them all.
RESOURCES¶
BCFtools GitHub website: <http://github.com/samtools/bcftools>
Samtools GitHub website: <http://github.com/samtools/samtools>
HTSlib GitHub website: <http://github.com/samtools/htslib>
File format specifications: <http://samtools.github.io/hts-specs>
BCFtools documentation: <http://samtools.github.io/bcftools>
BCFtools wiki page: <https://github.com/samtools/bcftools/wiki>
COPYING¶
The MIT/Expat License or GPL License, see the LICENSE document for details. Copyright (c) Genome Research Ltd.
2024-04-15 |