bbnorm.sh - Kmer-based error-correction and normalization tool
bbnorm.sh in=<input> out=<reads to keep> outt=<reads to toss> hist=<histogram output>
Normalizes read depth based on kmer counts. Can also error-correct, bin reads by kmer depth, and generate a kmer depth histogram. However, Tadpole has superior error-correction to BBNorm. Please read bbmap/docs/guides/BBNormGuide.txt for more information.
- Primary input. Use in2 for paired reads in a second file
- Second input file for paired reads in two files
- Additional files to use for input (generating hash table) but not for output
- Break up FASTA reads longer than this. Can be useful when processing scaffolded genomes
- Use at most this many reads when building the hashtable (-1 means all)
- Process every nth kmer, and skip the rest
- Process every nth read, and skip the rest
- May be set to true or false to force the input read file to ovverride autodetection of the input file as paired interleaved.
- ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.
- File for normalized or corrected reads. Use out2 for paired reads in a second file
- (outtoss) File for reads that were excluded from primary output
- Only process this number of reads, then quit (-1 means all)
- Use sampling on output as well as input (not used if sample rates are 1)
- Set to true to keep all reads (e.g. if you just want error correction).
- Set to true if you want kmers with a count of 0 to go in the 0 bin instead of the 1 bin in histograms.
- Default is false, to prevent confusion about how there can be 0-count kmers. The reason is that based on the 'minq' and 'minprob' settings, some kmers may be excluded from the bloom filter.
- This will specify a directory for temp files (only needed for multipass runs). If null, they will be written to the output directory.
- Allows enabling/disabling of temporary directory; if disabled, temp files will be written to the output directory.
- ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input).
- Rename reads based on their kmer depth.
- Kmer length (values under 32 are most efficient, but arbitrarily high values are supported)
- Bits per cell in bloom filter; must be 2, 4, 8, 16, or 32. Maximum kmer depth recorded is 2^cbits. Automatically reduced to 16 in 2-pass.
- Large values decrease accuracy for a fixed amount of memory, so use the lowest number you can that will still capture highest-depth kmers.
- Number of times each kmer is hashed and stored. Higher is slower.
- Higher is MORE accurate if there is enough memory, and LESS accurate if there is not enough memory.
- True is slower, but generally more accurate; filters out low-depth kmers from the main hashtable. The prefilter is more memory-efficient because it uses 2-bit cells.
- Number of hashes for prefilter.
- (pbits) Bits per cell in prefilter.
- Fraction of memory to allocate to prefilter.
- More passes can sometimes increase accuracy by iteratively removing low-depth kmers
- Ignore kmers containing bases with quality below this
- Ignore kmers with overall probability of correctness below this
- (t) Spawn exactly X hashing threads (default is number of logical processors). Total active threads may exceed X due to I/O threads.
- (removeduplicatekmers) When true, a kmer's count will only be incremented once per read pair, even if that kmer occurs more than once.
- (fs) Do a slower, high-precision bloom filter lookup of kmers that appear to have an abnormally high depth due to collisions.
- (tgt) Target normalization depth. NOTE: All depth parameters control kmer depth, not read depth.
- For kmer depth Dk, read depth Dr, read length R, and kmer size K: Dr=Dk*(R/(R-K+1))
- (max) Reads will not be downsampled when below this depth, even if they are above the target depth.
- (min) Kmers with depth below this number will not be included when calculating the depth of a read.
- (mgkpr) Reads must have at least this many kmers over min depth to be retained. Aka 'mingoodkmersperread'.
- (dp) Read depth is by default inferred from the 54th percentile of kmer depth, but this may be changed to any number 1-100.
- (uld) For pairs, use the depth of the lower read as the depth proxy.
- (dr) Generate random numbers deterministically to ensure identical output between multiple runs. May decrease speed with a huge number of threads.
- (p) 1 pass is the basic mode. 2 passes (default) allows greater accuracy, error detection, better contol of output depth.
Error detection parameters¶
- (highdepthpercentile) Position in sorted kmer depth array used as proxy of a read's high kmer depth.
- (lowdepthpercentile) Position in sorted kmer depth array used as proxy of a read's low kmer depth.
- (tbr) Throw away reads detected as containing errors.
- (rbb) Only toss bad pairs if both reads are bad.
- (edr) Reads with a ratio of at least this much between their high and low depth kmers will be classified as error reads.
- (ht) Threshold for high kmer. A high kmer at this or above are considered non-error.
- (lt) Threshold for low kmer. Kmers at this and below are always considered errors.
Error correction parameters¶
- Set to true to correct errors. NOTE: Tadpole is now preferred for ecc as it does a better job.
- Correct up to this many errors per read. If more are detected, the read will remain unchanged.
- (ecr) Adjacent kmers with a depth ratio of at least this much between will be classified as an error.
- (echt) Threshold for high kmer. A kmer at this or above may be considered non-error.
- (eclt) Threshold for low kmer. Kmers at this and below are considered errors.
- Do not correct bases with quality above this value.
- (aggressiveErrorCorrection) Sets more aggressive values of ecr=100, ecclimit=7, echt=16, eclt=3.
- (conservativeErrorCorrection) Sets more conservative values of ecr=180, ecclimit=2, echt=30, eclt=1, sl=4, pl=4.
- (markErrorsOnly) Marks errors by reducing quality value of suspected errors; does not correct anything.
- (markUncorrectableErrors) Marks errors only on uncorrectable reads; requires 'ecc=t'.
- (ecco) Error correct by read overlap.
Depth binning parameters¶
- (lbd) Cutoff for low depth bin.
- (hbd) Cutoff for high depth bin.
- Pairs in which both reads have a median below lbd go into this file.
- Pairs in which both reads have a median above hbd go into this file.
- All other pairs go into this file.
- Specify a file to write the input kmer depth histogram.
- Specify a file to write the output kmer depth histogram.
- (histogramcolumns) Number of histogram columns, 2 or 3.
- (printzerocoverage) Print lines in the histogram with zero coverage.
- Max kmer depth displayed in histogram. Also affects statistics displayed, but does not affect normalization.
Peak calling parameters¶
- Write the peaks to this file. Default is stdout.
- (h) Ignore peaks shorter than this.
- (v) Ignore peaks with less area than this.
- (w) Ignore peaks narrower than this.
- (minp) Ignore peaks with an X-value below this.
- (maxp) Ignore peaks with an X-value above this.
- (maxpc) Print up to this many peaks (prioritizing height).
- This will set Java's memory usage, overriding autodetection.
- -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
Written by Brian Bushnell (Last modified October 19, 2017)
Please contact Brian Bushnell at email@example.com if you encounter any problems, or post at: http://seqanswers.com/forums/showthread.php?t=41057
This manpage was written by Andreas Tille for the Debian distribution and can be used for any other usage of the program.
|April 2019||bbnorm.sh 38.43|