'\" t .\" Title: mash-sketch .\" Author: [see the "AUTHOR(S)" section] .\" Generator: Asciidoctor 2.0.12 .\" Date: .\" Manual: \ \& .\" Source: \ \& .\" Language: English .\" .TH "MASH\-SKETCH" "1" "" "\ \&" "\ \&" .ie \n(.g .ds Aq \(aq .el .ds Aq ' .ss \n[.ss] 0 .nh .ad l .de URL \fI\\$2\fP <\\$1>\\$3 .. .als MTO URL .if \n[.g] \{\ . mso www.tmac . am URL . ad l . . . am MTO . ad l . . . LINKSTYLE blue R < > .\} .SH "NAME" mash\-sketch \- create sketches (reduced representations for fast operations) .SH "SYNOPSIS" .sp \fBmash sketch\fP [options] fast(a|q)[.gz] ... .SH "DESCRIPTION" .sp Create a sketch file, which is a reduced representation of a sequence or set of sequences (based on min\-hashes) that can be used for fast distance estimations. Input can be fasta or fastq files (gzipped or not), and "\-" can be given to read from standard input. Input files can also be files of file names (see \fB\-l\fP). For output, one sketch file will be generated, but it can have multiple sketches within it, divided by sequences or files (see \fB\-i\fP). By default, the output file name will be the first input file with a \(aq.msh\(aq extension, or \(aqstdin.msh\(aq if standard input is used (see \fB\-o\fP). .SH "OPTIONS" .sp \fB\-h\fP .RS 4 Help .RE .sp \fB\-p\fP .RS 4 Parallelism. This many threads will be spawned for processing. [1] .RE .SS "Input" .sp \fB\-l\fP .RS 4 List input. Each file contains a list of sequence files, one per line. .RE .SS "Output" .sp \fB\-o\fP .RS 4 Output prefix (first input file used if unspecified). The suffix \(aq.msh\(aq will be appended. .RE .SS "Sketching" .sp \fB\-k\fP .RS 4 K\-mer size. Hashes will be based on strings of this many nucleotides. Canonical nucleotides are used by default (see Alphabet options below). (1\-32) [21] .RE .sp \fB\-s\fP .RS 4 Sketch size. Each sketch will have at most this many non\-redundant min\-hashes. [1000] .RE .sp \fB\-i\fP .RS 4 Sketch individual sequences, rather than whole files. .RE .sp \fB\-w\fP .RS 4 Probability threshold for warning about low k\-mer size. (0\-1) [0.01] .RE .sp \fB\-r\fP .RS 4 Input is a read set. See Reads options below. Incompatible with \fB\-i\fP. .RE .SS "Sketching (reads)" .sp \fB\-b\fP .RS 4 Use a Bloom filter of this size (raw bytes or with K/M/G/T) to filter out unique k\-mers. This is useful if exact filtering with \fB\-m\fP uses too much memory. However, some unique k\-mers may pass erroneously, and copies cannot be counted beyond 2. Implies \fB\-r\fP. .RE .sp \fB\-m\fP .RS 4 Minimum copies of each k\-mer required to pass noise filter for reads. Implies \fB\-r\fP. [1] .RE .sp \fB\-c\fP .RS 4 Target coverage. Sketching will conclude if this coverage is reached before the end of the input file (estimated by average k\-mer multiplicity). Implies \fB\-r\fP. .RE .sp \fB\-g\fP .RS 4 Genome size. If specified, will be used for p\-value calculation instead of an estimated size from k\-mer content. Implies \fB\-r\fP. .RE .SS "Sketching (alphabet)" .sp \fB\-n\fP .RS 4 Preserve strand (by default, strand is ignored by using canonical DNA k\-mers, which are alphabetical minima of forward\-reverse pairs). Implied if an alphabet is specified with \fB\-a\fP or \fB\-z\fP. .RE .sp \fB\-a\fP .RS 4 Use amino acid alphabet (A\-Z, except BJOUXZ). Implies \fB\-n\fP, \fB\-k\fP 9. .RE .sp \fB\-z\fP .RS 4 Alphabet to base hashes on (case ignored by default; see \fB\-Z\fP). K\-mers with other characters will be ignored. Implies \fB\-n\fP. .RE .sp \fB\-Z\fP .RS 4 Preserve case in k\-mers and alphabet (case is ignored by default). Sequence letters whose case is not in the current alphabet will be skipped when sketching. .RE .SH "SEE ALSO" .sp mash(1)