.TH MERYL "1" "May 2015" "meryl 0~20150520+2004" "User Commands"
.SH NAME
meryl \- in- and out-of-core kmer counting and utilities
.SH SYNOPSIS
.SS Estimating memory requirements
.PP
.B meryl \-P
.BI -m \0kmersize
.RB [ \-c
.IR # ]
.RB [ \-p ]
.BI \-s \0seq.fasta
.PP
.B meryl \-P
.BI \-m \0kmersize
.RB [ \-c
.IR # ]
.RB [ \-p ]
.BI \-n \0mercount
.SS Building a table
.B meryl \-B
.BI -m \0kmersize
.RB [ \-c
.IR # ]
.RB [ \-p ]
.RB [ \-v ]
.RB [ \-f | \-r | \-C ]
.RB [ \-L
.IR minoccurrence ]
.RB [ \-U
.IR maxoccurrence ]
.RB [ \-threads
.IR n \0|
.RB { \-segments
.IR segments \0|
.BI \-memory \0megabytes \fR}
.RB [ \-configbatch \0[ \-sge
.IR jobname ]]]
.BI \-s \0seq.fasta
.BI \-o \0tblprefix
.PP
.B meryl
.BI \-countbatch \0number
.RB [ \-sgebuild
\fR"\fIqsuboptionstring\fR"]
.BI \-o \0tblprefix
.PP
.B meryl
.BI \-mergebatch \0number
.RB [ \-sgemerge
\fR"\fIqsuboptionstring\fR"]
.BI \-o \0tblprefix
.SS Performing operations on a table
.B meryl
.BI -M \0operation
.RB [ \-v ]
.BI \-s \0tblprefix
.RB [ \-s
.IR \0tblprefix2 \0...]
.BI \-o \0output
.SS Dumping a table
.B meryl \-Dh
.BI \-s \0tblprefix
.PP
.B meryl \-Dt
.BI \-n \0mincount
.BI \-s \0tblprefix
.SH DESCRIPTION
\fBmeryl\fR computes the kmer content of genomic sequences.
Kmer content is represented as a list of kmers and the number of times each occurs in the input sequences.
The kmer can be restricted to only the forward kmer, only the reverse kmer, or the canonical kmer (lexicographically smaller of the forward and reverse kmer at each location).
\fBMeryl\fR can report the histogram of counts, the list of kmers and their counts, or can perform mathematical and set operations on the processed data files.
.PP
The output of
.B meryl
is two binary files, called a meryl database, which can be
quickly dumped to provide a histogram of counts, or the actual counts.
A C++ library is supplied for direct access to the files.
.SH OPTIONS
.TP
.B \-P
Estimate memory requirements. Given a sequence file (\fB\-s\fR) or an upper limit on the number of mers in the file (\fB\-n\fR), compute the table size
(\fB\-t\fR in build) to minimize the memory usage. This mode recognizes the following options:
.RS
.TP
.BI \-m \0#
size of a mer (required)
.TP
.BI \-c \0#
homopolymer compression (optional)
.TP
.B \-p
enable positions
.TP
.BI \-s \0seq.fasta
Sequence file to be scanned to determine the number of mers
.TP
.BI \-n \0#
compute params assuming file with this many mers in it
.PP
Only one of \fB\-s\fR, \fB\-n\fR need to be specified.
If both are given, \fB\-s\fR takes priority.
.RE
.TP
.B \-B
Compute the mer-count tables given a sequence file (\fB\-s\fR) and lots of parameters.
By default, both strands are processed.
.RS
.TP
.B \-f
only build for the forward strand
.TP
.B \-r
only build for the reverse strand
.TP
.B \-C
use canonical mers (assumes both strands)
.TP
.BI \-L \0#
DON'T save mers that occur less than # times
.TP
.BI \-U \0#
DON'T save mers that occur more than # times
.TP
.BI \-m \0#
size of a mer (required)
.TP
.BI \-c \0#
homopolymer compression (optional)
.TP
.B \-p
enable positions
.TP
.BI \-s \0seq.fasta
sequence to build the table for
.TP
.BI \-o \0tblprefix
output table prefix
.TP
\fB\-v\fR
entertain the user
.PP
The
.B meryl
process can run in one large memory batch, in many small memory batches, or under SGE control, all with or without using multiple CPU cores.
By default, the computation is done as one large sequential process.
Multi\-threaded operation is possible, at additional memory expense, as
is segmented operation, at additional I/O expense.
.TP
.B Threaded operation
Split the counting in to n almost\-equally sized pieces.
This uses an extra h MB (from \fB\-P\fR) per thread.
.RS
.TP
.BI \-threads \0n
use
.I n
threads to build
.RE
.TP
.B Segmented, sequential operation
Split the counting into pieces that
will fit into no more than m MB of memory, or into n equal sized pieces.
Each piece is computed sequentially, and the results are merged at the end.
Only one of \fB\-memory\fR and \fB\-segments\fR is needed.
.RS
.TP
.BI \-memory \0m
use at most \fIm\fR MB of memory per segment
.TP
.BI \-segments \0n
use \fIn\fR segments
.PP
.RE
.TP
.B Segmented, batched operation
Same as sequential, except this allows each segment to be manually executed
in parallel.
Only one of \fB\-memory\fR and \fB\-segments\fR is needed.
Also see the
.I EXAMPLE
section on this page.
.RS
.TP
.BI \-memory \0m
use at most \fIm\fR MB of memory per segment
.TP
.BI \-segments \0n
use \fIn\fR segments
.TP
.B \-configbatch
create the batches
.TP
.BI \-countbatch \0n
run batch number \fIn\fR
.TP
.B \-mergebatch
merge the batches
.PP
Batched mode can run on the grid.
.TP
.BI \-sge \0jobname
unique job name for this execution.
\fBMeryl\fR will submit jobs with name mp\fIjobname\fR, nc\fIjobname\fR, nm\fIjobname\fR, for
phases prepare, count and merge.
.TP
.BI \-sgebuild \0"options"
.TP
.BI \-sgemerge \0"options"
any additional options to
.BR qsub (1)
(e.g., "\-p \fB\-153\fR \fB\-pe\fR thread 2 \fB\-A\fR merylaccount")
N.B. \- \fB\-N\fR will be ignored
N.B. \- be sure to quote the options
.RE
.RE
.TP
.B \-M
Given a list of tables, perform a math, logical or threshold operation.
Unless specified, all operations take any number of databases.
Math operations are:
.RS
.TP
.B min
count is the minimum count for all databases.  If the mer
does NOT exist in all databases, the mer has a zero count, and
is NOT in the output.
.TP
.B minexist
count is the minimum count for all databases that contain the mer
.TP
.B max
count is the maximum count for all databases
.TP
.B add
count is sum of the counts for all databases
.TP
.B sub
count is the first minus the second (binary only)
.TP
.B abs
count is the absolute value of the first minus the second (binary only)
.PP
Logical operations are:
.TP
.B and
outputs mer iff it exists in all databases
.TP
.B nand
outputs mer iff it exists in at least one, but not all, databases
.TP
.B or
outputs mer iff it exists in at least one database
.TP
.B xor
outputs mer iff it exists in an odd number of databases
.PP
Threshold operations are:
.TP
.BI lessthan \0x
outputs mer iff it has count <  x
.TP
.BI lessthanorequal \0x
outputs mer iff it has count <= x
.TP
.BI greaterthan \0x
outputs mer iff it has count >  x
.TP
.BI greaterthanorequal \0x
outputs mer iff it has count >= x
.TP
.BI equal \0x
outputs mer iff it has count == x
.PP
Threshold operations work on exactly one database.
.TP
.BI \-s \0tblprefix
use \fItblprefix\fR as a database
.TP
.BI \-o \0tblprefix
create this output
.TP
\fB\-v\fR
entertain the user
.RE
.TP
.B \-D
Dump table (not all of these work)
.RS
.TP
\fB\-Dd\fR
Dump a histogram of the distance between the same mers.
.TP
\fB\-Dt\fR
Dump mers >= a threshold.  Use \fB\-n\fR to specify the threshold.
.TP
\fB\-Dc\fR
Count the number of mers, distinct mers and unique mers.
.TP
\fB\-Dh\fR
Dump (to stdout) a histogram of mer counts.
.TP
\fB\-s\fR
Read the count table from here (leave off the .mcdat or .mcidx).
.RE
.SH EXAMPLE
.SS Batch creation of a table
Initialize the compute with \fB\-configbatch\fR, which needs all the build options.
Execute all \fB\-countbatch\fR jobs, then \fB\-mergebatch\fR to complete.
.PP
.nf
.RS
meryl \fB\-configbatch\fR \fB\-B\fR [options] \fB\-o\fR file
meryl \fB\-countbatch\fR 0 \fB\-o\fR file
meryl \fB\-countbatch\fR 1 \fB\-o\fR file
\&...
meryl \fB\-countbatch\fR N \fB\-o\fR file
meryl \fB\-mergebatch\fR N \fB\-o\fR file
.RE
.fi
.SH SEE ALSO
.BR simple (1),
.BR mapMers (1),
.BR mapMers-depth (1),
.BR kmer-mask (1)