.\" DO NOT MODIFY THIS FILE! It was generated by help2man 1.47.15. .TH COLLAPSESEQ.PY "1" "May 2020" "CollapseSeq.py 0.6.0" "User Commands" .SH NAME CollapseSeq.py \- emoves duplicate sequences from FASTA/FASTQ files .SH DESCRIPTION usage: CollapseSeq.py [\-\-version] [\-h] \fB\-s\fR SEQ_FILES [SEQ_FILES ...] .TP [\-o OUT_FILES [OUT_FILES ...]] [\-\-outdir OUT_DIR] [\-\-outname OUT_NAME] [\-\-log LOG_FILE] [\-\-failed] [\-\-fasta] [\-\-delim DELIMITER DELIMITER DELIMITER] [\-n MAX_MISSING] [\-\-uf UNIQ_FIELDS [UNIQ_FIELDS ...]] [\-\-cf COPY_FIELDS [COPY_FIELDS ...]] [\-\-act {min,max,sum,set} [{min,max,sum,set} ...]] [\-\-inner] [\-\-keepmiss] [\-\-maxf MAX_FIELD | \fB\-\-minf\fR MIN_FIELD] .PP Removes duplicate sequences from FASTA/FASTQ files .SS "help:" .TP \fB\-\-version\fR show program's version number and exit .TP \fB\-h\fR, \fB\-\-help\fR show this help message and exit .SS "standard arguments:" .TP \fB\-s\fR SEQ_FILES [SEQ_FILES ...] A list of FASTA/FASTQ files containing sequences to process. (default: None) .TP \fB\-o\fR OUT_FILES [OUT_FILES ...] Explicit output file name(s). Note, this argument cannot be used with the \fB\-\-failed\fR, \fB\-\-outdir\fR, or \fB\-\-outname\fR arguments. If unspecified, then the output filename will be based on the input filename(s). (default: None) .TP \fB\-\-outdir\fR OUT_DIR Specify to changes the output directory to the location specified. The input file directory is used if this is not specified. (default: None) .TP \fB\-\-outname\fR OUT_NAME Changes the prefix of the successfully processed output file to the string specified. May not be specified with multiple input files. (default: None) .TP \fB\-\-log\fR LOG_FILE Specify to write verbose logging to a file. May not be specified with multiple input files. (default: None) .TP \fB\-\-failed\fR If specified create files containing records that fail processing. (default: False) .TP \fB\-\-fasta\fR Specify to force output as FASTA rather than FASTQ. (default: None) .TP \fB\-\-delim\fR DELIMITER DELIMITER DELIMITER A list of the three delimiters that separate annotation blocks, field names and values, and values within a field, respectively. (default: ('|', '=', \&',')) .SS "collapse arguments:" .TP \fB\-n\fR MAX_MISSING Maximum number of missing nucleotides to consider for collapsing sequences. A sequence will be considered undetermined if it contains too many missing nucleotides. (default: 0) .TP \fB\-\-uf\fR UNIQ_FIELDS [UNIQ_FIELDS ...] Specifies a set of annotation fields that must match for sequences to be considered duplicates. (default: None) .TP \fB\-\-cf\fR COPY_FIELDS [COPY_FIELDS ...] Specifies a set of annotation fields to copy into the unique sequence output. (default: None) .TP \fB\-\-act\fR {min,max,sum,set} [{min,max,sum,set} ...] List of actions to take for each copy field which defines how each annotation will be combined into a single value. The actions "min", "max", "sum" perform the corresponding mathematical operation on numeric annotations. The action "set" collapses annotations into a comma delimited list of unique values. (default: None) .TP \fB\-\-inner\fR If specified, exclude consecutive missing characters at either end of the sequence. (default: False) .TP \fB\-\-keepmiss\fR If specified, sequences with more missing characters than the threshold set by the \fB\-n\fR parameter will be written to the unique sequence output file with a DUPCOUNT=1 annotation. If not specified, such sequences will be written to a separate file. (default: False) .TP \fB\-\-maxf\fR MAX_FIELD Specify the field whose maximum value determines the retained sequence; mutually exclusive with \fB\-\-minf\fR. (default: None) .TP \fB\-\-minf\fR MIN_FIELD Specify the field whose minimum value determines the retained sequence; mutually exclusive with \fB\-\-minf\fR. (default: None) .SS "output files:" .IP collapse\-unique .IP unique sequences. Contains one representative from each set of duplicate sequences. The retained representative is determined by user defined criteria. .IP collapse\-duplicate .IP raw reads which are duplicates of the sequences retained in the collapse\-unique file. .IP collapse\-undetermined .IP raw reads which were excluded from consideration due to having too many N characters in the sequence. .SS "output annotation fields:" .IP DUPCOUNT .IP total number of sequences within the set of duplicates for each retained unique sequence. Meaning, the copy number of each unique sequence within the data file. .IP .IP annotation fields specified by the \fB\-\-cf\fR parameter. .SH AUTHOR This manpage was written by Andreas Tille for the Debian distribution and can be used for any other usage of the program.