Scroll to navigation

UNIKMER(1) User Commands UNIKMER(1)

NAME

unikmer - Toolkit for nucleic acid k-mer analysis

DESCRIPTION

unikmer - Toolkit for k-mer with taxonomic information

unikmer is a toolkit for nucleic acid k-mer analysis, providing functions including set operation on k-mers optional with TaxIds but without count information.

K-mers are either encoded (k<=32) or hashed (arbitrary k) into 'uint64', and serialized in binary file with extension '.unik'.

TaxIds can be assigned when counting k-mers from genome sequences, and LCA (Lowest Common Ancestor) is computed during set opertions including computing union, intersection, set difference, unique and repeated k-mers.

Version: v0.19.0

Author: Wei Shen <shenwei356@gmail.com>

Documents : https://bioinf.shenwei.me/unikmer Source code: https://github.com/shenwei356/unikmer

Dataset (optional):

Manipulating k-mers with TaxIds needs taxonomy file from e.g., NCBI Taxonomy database, please extract "nodes.dmp", "names.dmp", "delnodes.dmp" and "merged.dmp" from link below into ~/.unikmer/ , ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz , or some other directory, and later you can refer to using flag --data-dir or environment variable UNIKMER_DB.
For GTDB, use 'taxonkit create-taxdump' to create NCBI-style taxonomy dump files, or download from:
https://github.com/shenwei356/gtdb-taxonomy
Note that TaxIds are represented using uint32 and stored in 4 or less bytes, all TaxIds should be in the range of [1, 4294967295]

Usage:

unikmer [command]

Available Commands:

autocompletion Generate shell autocompletion script (bash|zsh|fish|powershell) common Find k-mers shared by most of multiple binary files concat Concatenate multiple binary files without removing duplicates count Generate k-mers (sketch) from FASTA/Q sequences decode Decode encoded integer to k-mer text diff Set difference of multiple binary files dump Convert plain k-mer text to binary format encode Encode plain k-mer text to integer filter Filter out low-complexity k-mers (experimental) grep Search k-mers from binary files head Extract the first N k-mers info Information of binary files inter Intersection of multiple binary files locate Locate k-mers in genome merge Merge k-mers from sorted chunk files num Quickly inspect number of k-mers in binary files rfilter Filter k-mers by taxonomic rank sample Sample k-mers from binary files sort Sort k-mers in binary files to reduce file size split Split k-mers into sorted chunk files tsplit Split k-mers according to taxid union Union of multiple binary files uniqs Mapping k-mers back to genome and find unique subsequences version Print version information and check for update view Read and output binary format to plain text

Flags:

write compact binary file with little loss of speed
compression level (default -1)
directory containing NCBI Taxonomy files, including nodes.dmp, names.dmp, merged.dmp and delnodes.dmp (default "/home/nilesh/.unikmer")
help for unikmer
ignore taxonomy information
file of input files list (one file per line), if given, they are appended to files from cli arguments
for smaller TaxIds, we can use less space to store TaxIds. default value is 1<<32-1, that's enough for NCBI Taxonomy TaxIds (default 4294967295)
do not compress binary file (not recommended)
do not check binary file, when using process substitution or named pipe
number of CPUs to use (default 4)
print verbose information

Use "unikmer [command] --help" for more information about a command.

August 2022 unikmer 0.19.0