.\" DO NOT MODIFY THIS FILE!  It was generated by help2man 1.43.3.
.TH PSI-CD-HIT-2D.PL "1" "August 2013" "psi-cd-hit-2d.pl 4.6.1-2012-08-27" "User Commands"
.SH NAME
psi-cd-hit-2d.pl \- runs similar algorithm like CD-HIT but using BLAST to calculate similarities in db1 or db2 format
.SH DESCRIPTION
Usage psi\-cd\-hit\-2d [Options]
.PP
Options
.TP
\fB\-i\fR
in_dbname, required
.TP
\fB\-o\fR
out_dbname, required
.TP
\fB\-c\fR
clustering threshold (sequence identity), default 0.3
.HP
\fB\-ce\fR clustering threshold (blast expect), default \fB\-1\fR,
.IP
it means by default it doesn't use expect threshold,
but with positive value, the program cluster seqs if similarities
meet either identity threshold or expect threshold
.TP
\fB\-L\fR
coverage of shorter sequence ( aligned / full), default 0.0
.TP
\fB\-M\fR
coverage of longer sequence ( aligned / full), default 0.0
.TP
\fB\-R\fR
(1/0) use psi\-blast profile? default 0
perform psi\-blast / pdb\-blast type search
.TP
\fB\-G\fR
(1/0) use global identity? default 1
sequence identity calculated as
.IP
total identical residues of local alignments /
length of shorter seq
.IP
if you prefer to use \fB\-G\fR 0, it is suggested that you also
use \fB\-L\fR, such as \fB\-L\fR 0.8, to prevent very short matches.
.TP
\fB\-d\fR
length of description line in the .clstr file, default 30
if set to 0, it takes the fasta defline and stops at first space
.TP
\fB\-l\fR
length_of_throw_away_sequences, default 10
.TP
\fB\-p\fR
profile search para, default
.IP
"\-a 2 \fB\-d\fR nr80 \fB\-j\fR 3 \fB\-F\fR F \fB\-e\fR 0.001 \fB\-b\fR 500 \fB\-v\fR 500"
.HP
\fB\-bfdb\fR profile database, default nr80
.TP
\fB\-s\fR
blast search para, default
.IP
"\-F F \fB\-e\fR 0.000001 \fB\-b\fR 100000 \fB\-v\fR 100000"
.HP
\fB\-be\fR blast expect cutoff, default 0.000001
.TP
\fB\-b\fR
filename of list of hosts
to run this program in parallel with ssh calls, you need provide
a list of hosts
.HP
\fB\-pbs\fR No of jobs to send each time by PBS querying system
.IP
you can not use both ssh and pbs at same time
.HP
\fB\-k\fR (1/0) keep blast raw output file, default 1
.HP
\fB\-rs\fR steps of save restart file and clustering output, default 5000
.IP
everytime after process 5000 sequences, program write a
restart file and current clustering information
.HP
\fB\-restart\fR restart file, readin a restart file
.IP
if program crash, stoped, termitated, you can restart it by
add a option "\-restart sth.restart"
.HP
\fB\-rf\fR steps of re format blast database, default 200,000
.IP
if program clustered 200,000 seqs, it remove them from seq
pool, and re format blast db to save time
.HP
\fB\-local\fR dir of local blast db,
.IP
when run in parallel with ssh (not pbs), I can copy blast dbs
to local drives on each node to save blast db reading time
BUT, IT MAY NOT FASTER
.TP
\fB\-J\fR
job, job_file, exe specific jobs like parse blast outonly
DON'T use it, it is only used by this program itself
.HP
\fB\-single\fR files of ids those you known that they are singletons
.IP
so I won't run them as queries
.HP
\fB\-i2\fR second input database
.HP
\fB\-blastn\fR run blastn, default 0
.HP
\fB\-lo\fR how long can seq in db2 > db1 in a cluster, default 0
.IP
means, that seq in db2 should <= seqs in db1 in a cluster
.IP
==============================
by Weizhong Li, liwz@sdsc.edu
==============================
.IP
If you find cd\-hit useful, please kindly cite:
.IP
"Clustering of highly homologous sequences to reduce thesize of large protein database", Weizhong Li, Lukasz Jaroszewski & Adam GodzikBioinformatics, (2001) 17:282\-283
"Cd\-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik Bioinformatics, (2006) 22:1658\-1659