reformat.sh - Reformats reads between fasta/fastq/scarf/fasta+qual/sam, interleaved/paired, and ASCII-33/64
reformat.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2>
Reformats reads to change ASCII quality encoding, interleaving, file format, or compression format. Optionally performs additional functions such as quality trimming, subsetting, and subsampling. Supports fastq, fasta, fasta+qual, scarf, oneline, sam, bam, gzip, bz2. Please read bbmap/docs/guides/ReformatGuide.txt for more information.
in2 and out2 are for paired reads and are optional. If input is paired and there is only one output file, it will be written interleaved.
Parameters and their defaults:¶
- (overwrite) Overwrites files that already exist.
- (append) Append to files that already exist.
- (ziplevel) Set compression level, 1 (low) to 9 (max).
- (interleaved) Determines whether INPUT file is considered interleaved.
- Length of lines in fasta output.
- Set to a non-zero number to break fasta files into reads of at most this length.
- Ignore fasta reads shorter than this.
- ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.
- ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input).
- Quality value used for fasta to fastq reformatting.
- qfin=<.qual file>
- Read qualities from this qual file, for the reads coming from 'in=<fasta file>'
- qfin2=<.qual file>
- Read qualities from this qual file, for the reads coming from 'in2=<fasta file>'
- qfout=<.qual file>
- Write qualities from this qual file, for the reads going to 'out=<fasta file>'
- qfout2=<.qual file>
- Write qualities from this qual file, for the reads coming from 'out2=<fasta file>'
- (outs) If a read is longer than minlength and its mate is shorter, the longer one goes here.
- Delete input upon successful completion.
- Optional reference fasta for sam processing.
- (vpair) When true, checks reads to see if the names look paired. Prints an error message if not.
- (vint) sets 'vpair' to true and 'interleaved' to true.
- (ain) When verifying pair names, allows identical names, instead of requiring /1 and /2 or 1: and 2:
- (tbr) Discard reads that have different numbers of bases and qualities. By default this will be detected and cause a crash.
- (ibq) Fix out-of-range quality values instead of crashing with a warning.
- Append ' /1' and ' /2' to read names, if not already present. Please include the flag 'int=t' if the reads are interleaved.
- Put a space before the slash in addslash mode.
- Append ' 1:' and ' 2:' to read names, if not already present. Please include the flag 'int=t' if the reads are interleaved.
- Change whitespace in read names to underscores.
- (rc) Reverse-compliment reads.
- (rcm) Reverse-compliment read 2 only.
- (cq) N bases always get a quality of 0 and ACGT bases get a min quality of 2.
- Quantize qualities to a subset of values like NextSeq. Can also be used with comma-delimited list, like quantize=0,8,13,22,27,32,37
- (touppercase) Change lowercase letters in reads to uppercase.
- Make duplicate names unique by appending _<number>.
- A set of pairs: remap=CTGN will transform C>T and G>N.
- Use remap1 and remap2 to specify read 1 or 2.
- (itn) Convert non-ACGTN symbols to N.
- Kill this process if it crashes. monitor=600,0.01 would kill after 600 seconds under 1% usage.
- Crash when encountering reads with invalid bases.
- Discard reads with invalid characters as bases.
- Convert invalid bases to N.
- Convert nonstandard header characters to standard ASCII.
- (recal) Recalibrate quality scores. Must first generate matrices with CalcTrueQuality.
- Quality scores capped at this upper bound.
- Quality scores of ACGT bases will be capped at lower bound.
- (trd) Trim the names of reads after the first whitespace.
- For sam/bam files, trim rname/rnext fields after the first space.
- Replace characters in headers such as space, *, and | to make them valid file names.
- For fasta, issue a warning if a sequenceless header is encountered.
- Issue a warning for only the first sequenceless header.
- Convert U to T (for RNA -> DNA translation).
- Pad the left end of sequences with this many symbols.
- Pad the right end of sequences with this many symbols.
- Set padleft and padright to the same value.
- Symbol to use for padding.
Histogram output parameters¶
- Base composition histogram by position.
- Quality histogram by position.
- Count of bases with each quality value.
- Histogram of average read quality.
- Quality histogram designed for box plots.
- Read length histogram.
- Read GC content histogram.
- Number gchist bins. Set to 'auto' to use read length.
- Add a graphical representation to the gchist.
- Set an upper bound for histogram lengths; higher uses more memory.
- The default is 6000 for some histograms and 80000 for others.
Histograms for sam files only (requires sam format 1.4 or higher):¶
- Errors-per-read histogram.
- Quality accuracy histogram of error rates versus quality score.
- Indel length histogram.
- Histogram of match, sub, del, and ins rates by read location.
- Insert size histograms. Requires paired reads in a sam file.
- Histogram of read count versus percent identity.
- Number idhist bins. Set to 'auto' to use read length.
- Set to a positive number to only process this many INPUT reads (or pairs), then quit.
- Skip (discard) this many INPUT reads before processing the rest.
- Randomly output only this fraction of reads; 1 means sampling is disabled.
- Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
- (srt) Exact number of OUTPUT reads (or pairs) desired.
- (sbt) Exact number of OUTPUT bases desired.
- Important: srt/sbt flags should not be used with stdin, samplerate, qtrim, minlength, or minavgquality.
Trimming and filtering parameters¶
- Trim read ends to remove bases with quality below trimq.
- Values: t (trim both ends), f (neither end), r (right end only), l (left end only), w (sliding window).
- Regions with average quality BELOW this will be trimmed. Can be a floating-point number like 7.3.
- (ml) Reads shorter than this after trimming will be discarded. Pairs will be discarded only if both are shorter.
- (mlf) Reads shorter than this fraction of original length after trimming will be discarded.
- If nonzero, reads longer than this after trimming will be discarded.
- If nonzero, reads longer than this will be broken into multiple reads of this length. Does not work for paired reads.
- (rbb) Only discard pairs if both reads are shorter than minlen.
- (invert) Output failing reads instead of passing reads.
- (maq) Reads with average quality (after trimming) below this will be discarded.
- If positive, calculate maq from this many initial bases.
- (cf) Reads with names containing ' 1:Y:' or ' 2:Y:' will be discarded.
- Remove reads with unexpected barcodes if barcodes is set, or barcodes containing 'N' otherwise.
- A barcode must be the last part of the read header.
- Comma-delimited list of barcodes or files of barcodes.
- If 0 or greater, reads with more Ns than this (after trimming) will be discarded.
- (mcb) Discard reads without at least this many consecutive called bases.
- (ftl) If nonzero, trim left bases of the read to this position (exclusive, 0-based).
- (ftr) If nonzero, trim right bases of the read after this position (exclusive, 0-based).
- (ftr2) If positive, trim this many bases on the right end.
- (ftm) If positive, trim length to be equal to zero modulo this number.
- Discard reads with GC content below this.
- Discard reads with GC content above this.
- Use average GC of paired reads.
- Also affects gchist.
Sam and bam processing options:¶
- Toss unmapped reads.
- Toss mapped reads.
- Toss reads that are not mapped as proper pairs.
- Toss reads that are mapped as proper pairs.
- Toss secondary alignments. Set this to true for sam to fastq conversion.
- If non-negative, toss reads with mapq under this.
- If non-negative, toss reads with mapq over this.
- (rbits) Toss sam lines with any of these flag bits unset. Similar to samtools -f.
- (fbits) Toss sam lines with any of these flag bits set. Similar to samtools -F.
- Set to true to write a tag indicating read stop location, prefixed by YS:i:
- Set to 'sam=1.3' to convert '=' and 'X' cigar symbols (from sam 1.4+ format) to 'M'.
- Set to 'sam=1.4' to convert 'M' to '=' and 'X' (sam=1.4 requires MD tags to be present, or ref to be specified).
Sam and bam alignment filtering options:¶
These require = and X symbols in cigar strings, or MD tags, or areference fasta. -1 means disabled; to filter reads with any of a symbol type, set to 0.
- Discard reads with more than this many substitutions.
- Discard reads with more than this many insertions.
- Discard reads with more than this many deletions.
- Discard reads with more than this many indels.
- Discard reads with more than this many edits.
- Discard reads with an insertion longer than this.
- Discard reads with a deletion longer than this.
- Discard reads with identity below this.
- Discard reads with more than this many soft-clipped bases.
Kmer counting and cardinality estimation:¶
The # symbol will be substituted for 1 and 2. The % symbol in out will be substituted for input name minus extensions. For example:
- reformat.sh in=read#.fq out=%.fa
...is equivalent to:
- reformat.sh in1=read1.fq in2=read2.fq out1=read1.fa out2=read2.fa
- This will set Java's memory usage, overriding autodetection.
- -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
Written by Brian Bushnell (Last modified February 21, 2019)
Please contact Brian Bushnell at email@example.com if you encounter any problems.
This manpage was written by Andreas Tille for the Debian distribution and can be used for any other usage of the program.
|April 2019||reformat.sh 38.43|