.\" DO NOT MODIFY THIS FILE! It was generated by help2man 1.43.3. .TH PSI-CD-HIT-2D.PL "1" "August 2013" "psi-cd-hit-2d.pl 4.6.1-2012-08-27" "User Commands" .SH NAME psi-cd-hit-2d.pl \- runs similar algorithm like CD-HIT but using BLAST to calculate similarities in db1 or db2 format .SH DESCRIPTION Usage psi\-cd\-hit\-2d [Options] .PP Options .TP \fB\-i\fR in_dbname, required .TP \fB\-o\fR out_dbname, required .TP \fB\-c\fR clustering threshold (sequence identity), default 0.3 .HP \fB\-ce\fR clustering threshold (blast expect), default \fB\-1\fR, .IP it means by default it doesn't use expect threshold, but with positive value, the program cluster seqs if similarities meet either identity threshold or expect threshold .TP \fB\-L\fR coverage of shorter sequence ( aligned / full), default 0.0 .TP \fB\-M\fR coverage of longer sequence ( aligned / full), default 0.0 .TP \fB\-R\fR (1/0) use psi\-blast profile? default 0 perform psi\-blast / pdb\-blast type search .TP \fB\-G\fR (1/0) use global identity? default 1 sequence identity calculated as .IP total identical residues of local alignments / length of shorter seq .IP if you prefer to use \fB\-G\fR 0, it is suggested that you also use \fB\-L\fR, such as \fB\-L\fR 0.8, to prevent very short matches. .TP \fB\-d\fR length of description line in the .clstr file, default 30 if set to 0, it takes the fasta defline and stops at first space .TP \fB\-l\fR length_of_throw_away_sequences, default 10 .TP \fB\-p\fR profile search para, default .IP "\-a 2 \fB\-d\fR nr80 \fB\-j\fR 3 \fB\-F\fR F \fB\-e\fR 0.001 \fB\-b\fR 500 \fB\-v\fR 500" .HP \fB\-bfdb\fR profile database, default nr80 .TP \fB\-s\fR blast search para, default .IP "\-F F \fB\-e\fR 0.000001 \fB\-b\fR 100000 \fB\-v\fR 100000" .HP \fB\-be\fR blast expect cutoff, default 0.000001 .TP \fB\-b\fR filename of list of hosts to run this program in parallel with ssh calls, you need provide a list of hosts .HP \fB\-pbs\fR No of jobs to send each time by PBS querying system .IP you can not use both ssh and pbs at same time .HP \fB\-k\fR (1/0) keep blast raw output file, default 1 .HP \fB\-rs\fR steps of save restart file and clustering output, default 5000 .IP everytime after process 5000 sequences, program write a restart file and current clustering information .HP \fB\-restart\fR restart file, readin a restart file .IP if program crash, stoped, termitated, you can restart it by add a option "\-restart sth.restart" .HP \fB\-rf\fR steps of re format blast database, default 200,000 .IP if program clustered 200,000 seqs, it remove them from seq pool, and re format blast db to save time .HP \fB\-local\fR dir of local blast db, .IP when run in parallel with ssh (not pbs), I can copy blast dbs to local drives on each node to save blast db reading time BUT, IT MAY NOT FASTER .TP \fB\-J\fR job, job_file, exe specific jobs like parse blast outonly DON'T use it, it is only used by this program itself .HP \fB\-single\fR files of ids those you known that they are singletons .IP so I won't run them as queries .HP \fB\-i2\fR second input database .HP \fB\-blastn\fR run blastn, default 0 .HP \fB\-lo\fR how long can seq in db2 > db1 in a cluster, default 0 .IP means, that seq in db2 should <= seqs in db1 in a cluster .IP ============================== by Weizhong Li, liwz@sdsc.edu ============================== .IP If you find cd\-hit useful, please kindly cite: .IP "Clustering of highly homologous sequences to reduce thesize of large protein database", Weizhong Li, Lukasz Jaroszewski & Adam GodzikBioinformatics, (2001) 17:282\-283 "Cd\-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik Bioinformatics, (2006) 22:1658\-1659