Scroll to navigation

CD-HIT-PARA.PL(1) User Commands CD-HIT-PARA.PL(1)

NAME

cd-hit-para.pl - divide a big clustering job into pieces to run cd-hit or cd-hit-est jobs

SYNOPSIS

cd-hit-para.pl options

DESCRIPTION

This script divide a big clustering job into pieces and submit jobs to remote computers over a network to make it parallel. After all the jobs finished, the script merge the clustering results as if you just run a single cd-hit or cd-hit-est.
You can also use it to divide big jobs on a single computer if your computer does not have enough RAM (with -L option).

Requirements:

1 When run this script over a network, the directory where you
run the scripts and the input files must be available on all the remote hosts with identical path.
2 If you choose "ssh" to submit jobs, you have to have
passwordless ssh to any remote host, see ssh manual to know how to set up passwordless ssh.
3 I suggest to use queuing system instead of ssh,
I currently support PBS and SGE
4 cd-hit cd-hit-2d cd-hit-est cd-hit-est-2d
cd-hit-div cd-hit-div.pl must be in same directory where this script is in.

Options

-i input filename in fasta format, required

-o output filename, required

--P program, "cd-hit" or "cd-hit-est", default "cd-hit"

--B filename of list of hosts,

requred unless -Q or -L option is supplied

--L number of cpus on local computer, default 0

when you are not running it over a cluster, you can use this option to divide a big clustering jobs into small pieces, I suggest you just use "--L 1" unless you have enough RAM for each cpu

--S Number of segments to split input DB into, default 64

--Q number of jobs to submit to queue queuing system, default 0

by default, the program use ssh mode to submit remote jobs

--T type of queuing system, "PBS", "SGE" are supported, default PBS

--R restart file, used after a crash of run

-h print this help

More cd-hit/cd-hit-est options can be speicified in command line

Questions, bugs, contact Weizhong Li at liwz@sdsc.edu
September 2018 cd-hit-para.pl 4.6.8