Scroll to navigation

TRIM_GALORE(1) User Commands TRIM_GALORE(1)

NAME

trim_galore - automate quality and adapter trimming for DNA sequencing

DESCRIPTION

USAGE:

trim_galore [options] <filename(s)>

-h/--help Print this help message and exits.

-v/--version Print the version information and exits.

-q/--quality <INT> Trim low-quality ends from reads in addition to adapter removal. For

trimming is carried in a second round. Other files are quality and adapter trimmed in a single pass. The algorithm is the same as the one used by BWA (Subtract INT from all qualities; compute partial sums from all indices to the end of the sequence; cut sequence at the index at which the sum is minimal). Default Phred score: 20.

--phred33 Instructs Cutadapt to use ASCII+33 quality scores as Phred scores

(Sanger/Illumina 1.9+ encoding) for quality trimming. Default: ON.

--phred64 Instructs Cutadapt to use ASCII+64 quality scores as Phred scores

(Illumina 1.5 encoding) for quality trimming.

--fastqc Run FastQC in the default mode on the FastQ file once trimming is complete.

--fastqc_args "<ARGS>" Passes extra arguments to FastQC. If more than one argument is to be passed

--fastqc_args "--nogroup --outdir /home/". Passing extra arguments will automatically invoke FastQC, so --fastqc does not have to be specified separately.

-a/--adapter <STRING> Adapter sequence to be trimmed. If not specified explicitly, Trim Galore will

small RNA adapter sequence was used. Also see '--illumina', '--nextera' and '--small_rna'. If no adapter can be detected within the first 1 million sequences of the first file specified or if there is a tie between several adapter sequences, Trim Galore defaults to '--illumina' (as long as the Illumina adapter was one of the options, else '--nextera' is the default). A single base may also be given as e.g. -a A{10}, to be expanded to -a AAAAAAAAAA.

-a2/--adapter2 <STRING> Optional adapter sequence to be trimmed off read 2 of paired-end files. This

are smallRNA then a2 will be set to the Illumina small RNA 5' adapter automatically (GATCGTCGGACT). A single base may also be given as e.g. -a2 A{10}, to be expanded to -a2 AAAAAAAAAA.

--illumina Adapter sequence to be trimmed is the first 13bp of the Illumina universal adapter

'AGATCGGAAGAGC' instead of the default auto-detection of adapter sequence.

--nextera Adapter sequence to be trimmed is the first 12bp of the Nextera adapter

'CTGTCTCTTATA' instead of the default auto-detection of adapter sequence.

--small_rna Adapter sequence to be trimmed is the first 12bp of the Illumina Small RNA 3' Adapter

'TGGAATTCTCGG' instead of the default auto-detection of adapter sequence. Selecting
to trim smallRNA adapters will also lower the --length value to 18bp. If the smallRNA libraries are paired-end then a2 will be set to the Illumina small RNA 5' adapter automatically (GATCGTCGGACT) unless -a 2 had been defined explicitly.

--consider_already_trimmed <INT> During adapter auto-detection, the limit set by <INT> allows the user to

sequence exceeds this threshold, no additional adapter trimming will be performed (technically, the adapter is set to '-a X'). Quality trimming is still performed as usual. Default: NOT SELECTED (i.e. normal auto-detection precedence rules apply).

--max_length <INT> Discard reads that are longer than <INT> bp after trimming. This is only advised for

smallRNA sequencing to remove non-small RNA sequences.

--stringency <INT> Overlap with adapter sequence required to trim a sequence. Defaults to a

will be trimmed off from the 3' end of any read.

-e <ERROR RATE> Maximum allowed error rate (no. of errors divided by the length of the matching

region) (default: 0.1)

--gzip Compress the output file with GZIP. If the input files are GZIP-compressed

compression will take place on the fly.

--dont_gzip Output files won't be compressed with GZIP. This option overrides --gzip.

--length <INT> Discard reads that became shorter than length INT because of either

this behaviour. Default: 20 bp.
<INT> bp to be printed out to validated paired-end files (see option --paired). If only one read became too short there is the possibility of keeping such unpaired single-end reads (see --retain_unpaired). Default pair-cutoff: 20 bp.

--max_n COUNT The total number of Ns (as integer) a read may contain before it will be removed altogether.

pair being removed from the trimmed output files.

--trim-n Removes Ns from either side of the read. This option does currently not work in RRBS mode.

-o/--output_dir <DIR> If specified all output will be written to this directory instead of the current

directory. If the directory doesn't exist it will be created for you.

--no_report_file If specified no report file will be generated.

--suppress_warn If specified any output to STDOUT or STDERR will be suppressed.

--clip_R1 <int> Instructs Trim Galore to remove <int> bp from the 5' end of read 1 (or single-end

sort of unwanted bias at the 5' end. Default: OFF.

--clip_R2 <int> Instructs Trim Galore to remove <int> bp from the 5' end of read 2 (paired-end reads

of unwanted bias at the 5' end. For paired-end BS-Seq, it is recommended to remove the first few bp because the end-repair reaction may introduce a bias towards low methylation. Please refer to the M-bias plot section in the Bismark User Guide for some examples. Default: OFF.

--three_prime_clip_R1 <int> Instructs Trim Galore to remove <int> bp from the 3' end of read 1 (or single-end

bias from the 3' end that is not directly related to adapter sequence or basecall quality. Default: OFF.

--three_prime_clip_R2 <int> Instructs Trim Galore to remove <int> bp from the 3' end of read 2 AFTER

the 3' end that is not directly related to adapter sequence or basecall quality. Default: OFF.

--2colour/--nextseq INT This enables the option '--nextseq-trim=3'CUTOFF' within Cutadapt, which will set a quality

This trimming is in common for the NextSeq- and NovaSeq-platforms, where basecalls without any signal are called as high-quality G bases. This is mutually exclusive with '-q INT'.

--path_to_cutadapt </path/to/cutadapt> You may use this option to specify a path to the Cutadapt executable,

the PATH.

--basename <PREFERRED_NAME> Use PREFERRED_NAME as the basename for output files, instead of deriving the filenames from

PREFERRED_NAME_val_1.fq(.gz) and PREFERRED_NAME_val_2.fq(.gz) for paired-end data. --basename only works when 1 file (single-end) or 2 files (paired-end) are specified, but not for longer lists.

-j/--cores INT Number of cores to be used for trimming [default: 1]. For Cutadapt to work with multiple cores, it

is detected from the shebang line of the Cutadapt executable (either 'cutadapt', or a specified path). If Python 2 is detected, --cores is set to 1. If pigz cannot be detected on your system, Trim Galore reverts to using gzip compression. Please note that gzip compression will slow down multi-core processes so much that it is hardly worthwhile, please see: https://github.com/FelixKrueger/TrimGalore/issues/16#issuecomment-458557103 for more info).
Assuming that Python 3 is used and pigz is installed, --cores 2 would use 2 cores to read the input (probably not at a high usage though), 2 cores to write to the output (at moderately high usage), and 2 cores for Cutadapt itself + 2 additional cores for Cutadapt (not sure what they are used for) + 1 core for Trim Galore itself. So this can be up to 9 cores, even though most of them won't be used at 100% for most of the time. Paired-end processing uses twice as many cores for the validation (= writing out) step. --cores 4 would then be: 4 (read) + 4 (write) + 4 (Cutadapt) + 2 (extra Cutadapt) + 1 (Trim Galore) = 15.
It seems that --cores 4 could be a sweet spot, anything above has diminishing returns.

SPECIFIC TRIMMING - without adapter/quality trimming

--hardtrim5 <int> Instead of performing adapter-/quality trimming, this option will simply hard-trim sequences

Hard-trimmed output files will end in .<int>_5prime.fq(.gz). Here is an example:
CCTAAGGAAACAAGTACACTCCACACATGCATAAAGGAAATCAAATGTTATTTTTAAGAAAATGGAAAAT
--hardtrim5 20: CCTAAGGAAACAAGTACACT

--hardtrim3 <int> Instead of performing adapter-/quality trimming, this option will simply hard-trim sequences

Hard-trimmed output files will end in .<int>_3prime.fq(.gz). Here is an example:
CCTAAGGAAACAAGTACACTCCACACATGCATAAAGGAAATCAAATGTTATTTTTAAGAAAATGGAAAAT
TTTTTAAGAAAATGGAAAAT

--clock In this mode, reads are trimmed in a specific way that is currently used for the Mouse

Genome Biology, 2017 18:68 https://doi.org/10.1186/s13059-017-1203-5). Following this, Trim Galore will exit.
In it's current implementation, the dual-UMI RRBS reads come in the following format:
5' UUUUUUUU CAGTA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF TACTG UUUUUUUU 3'
3' UUUUUUUU GTCAT FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF ATGAC UUUUUUUU 5'
and FFFFFFF... is the actual RRBS-Fragment to be sequenced. The UMIs for Read 1 (R1) and Read 2 (R2), as well as the fixed sequences (F1 or F2), are written into the read ID and removed from the actual sequence. Here is an example:
ATCTAGTTCAGTACGGTGTTTTCGAATTAGAAAAATATGTATAGAGGAAATAGATATAAAGGCGTATTCGTTATTG
CAATTTTGCAGTACAAAAATAATACCTCCTCTATTTATCCAAAATCACAAAAAACCACCCACTTAACTTTCCCTAA
CGGTGTTTTCGAATTAGAAAAATATGTATAGAGGAAATAGATATAAAGGCGTATTCGTTATTG
CAAAAATAATACCTCCTCTATTTATCCAAAATCACAAAAAACCACCCACTTAACTTTCCCTAA
should be adapter- and quality trimmed with Trim Galore as usual. In addition, reads need to be trimmed by 15bp from their 3' end to get rid of potential UMI and fixed sequences. The command is:
trim_galore --paired --three_prime_clip_R1 15 --three_prime_clip_R2 15 *.clock_UMI.R1.fq.gz *.clock_UMI.R2.fq.gz
in '--dual_index' mode (see here: https://github.com/FelixKrueger/Umi-Grinder). UmiBam recognises the UMIs within this pattern: R1:(ATCTAGTT):R2:(CAATTTTG): as (UMI R1) and (UMI R2).

--polyA This is a new, still experimental, trimming mode to identify and remove poly-A tails from sequences.

sequences contain more often a stretch of either 'AAAAAAAAAA' or 'TTTTTTTTTT'. This determines if Read 1 of a paired-end end file, or single-end files, are trimmed for PolyA or PolyT. In case of paired-end sequencing, Read2 is trimmed for the complementary base from the start of the reads. The auto-detection uses a default of A{20} for Read1 (3'-end trimming) and T{150} for Read2 (5'-end trimming). These values may be changed manually using the options -a and -a2.
how many bases were trimmed so it can later be used to identify PolyA trimmed sequences. This is currently done by writing tags to both the start ("32:A:") and end ("_PolyA:32") of the reads in the following example:
@READ-ID:1:1102:22039:36996 1:N:0:CCTAATCC
GCCTAAGGAAACAAGTACACTCCACACATGCATAAAGGAAATCAAATGTTATTTTTAAGAAAATGGAAAATAAAAACTTTATAAACACCAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
@32:A:READ-ID:1:1102:22039:36996_1:N:0:CCTAATCC_PolyA:32
GCCTAAGGAAACAAGTACACTCCACACATGCATAAAGGAAATCAAATGTTATTTTTAAGAAAATGGAAAATAAAAACTTTATAAACACC
before looking for Poly-A tails, and it is the user's responsibility to carry out an initial round of trimming. The following sequence:
1) trim_galore file.fastq.gz
2) trim_galore --polyA file_trimmed.fq.gz 3) zcat file_trimmed_trimmed.fq.gz | grep -A 3 PolyA | grep -v ^-- > PolyA_trimmed.fastq
Finally, if desired, 3) will specifically find PolyA trimmed sequences to a specific FastQ file of your choice.

RRBS-specific options (MspI digested material):

--rrbs Specifies that the input file was an MspI digested RRBS sample (recognition

will have a further 2 bp removed from their 3' end. Sequences which were merely trimmed because of poor quality will not be shortened further. Read 2 of paired-end libraries will in addition have the first 2 bp removed from the 5' end (by setting '--clip_r2 2'). This is to avoid using artificial methylation calls from the filled-in cytosine positions close to the 3' MspI site in sequenced fragments. This option is not recommended for users of the NuGEN ovation RRBS System 1-16 kit (see below).

--non_directional Selecting this option for non-directional RRBS libraries will screen

and, if found, removes the first two basepairs. Like with the option '--rrbs' this avoids using cytosine positions that were filled-in during the end-repair step. '--non_directional' requires '--rrbs' to be specified as well. Note that this option does not set '--clip_r2 2' in paired-end mode.

--keep Keep the quality trimmed intermediate file. Default: off, which means

an effect for RRBS samples since other FastQ files are not trimmed for poor qualities separately.

Note for RRBS using the NuGEN Ovation RRBS System 1-16 kit:

Owing to the fact that the NuGEN Ovation kit attaches a varying number of nucleotides (0-3) after each MspI site Trim Galore should be run WITHOUT the option --rrbs. This trimming is accomplished in a subsequent diversity trimming step afterwards (see their manual).

Note for RRBS using MseI:

If your DNA material was digested with MseI (recognition motif: TTAA) instead of MspI it is NOT necessary to specify --rrbs or --non_directional since virtually all reads should start with the sequence 'TAA', and this holds true for both directional and non-directional libraries. As the end-repair of 'TAA' restricted sites does not involve any cytosines it does not need to be treated especially. Instead, simply run Trim Galore! in the standard (i.e. non-RRBS) mode.

Paired-end specific options:

--paired This option performs length trimming of quality/adapter/RRBS trimmed reads for

are required to have a certain minimum length which is governed by the option --length (see above). If only one read passes this length threshold the other read can be rescued (see option --retain_unpaired). Using this option lets you discard too short read pairs without disturbing the sequence-by-sequence order of FastQ files which is required by many aligners.
file1_1.fq file1_2.fq SRR2_1.fq.gz SRR2_2.fq.gz ... .

-t/--trim1 Trims 1 bp off every read from its 3' end. This may be needed for FastQ files that

alignments like this:
or this: -----------------------> R1
<----------------- R2
NOTE: If you are planning to use Bowtie2, BWA etc. you don't need to specify this option.

--retain_unpaired If only one of the two paired-end reads became too short, the longer

output files. The length cutoff for unpaired single-end reads is governed by the parameters -r1/--length_1 and -r2/--length_2. Default: OFF.

-r1/--length_1 <INT> Unpaired single-end read length cutoff needed for read 1 to be written to

'.unpaired_1.fq' output file. These reads may be mapped in single-end mode.
Default: 35 bp.

-r2/--length_2 <INT> Unpaired single-end read length cutoff needed for read 2 to be written to

'.unpaired_2.fq' output file. These reads may be mapped in single-end mode.
Default: 35 bp.

Last modified on 07 November 2019.

AUTHOR


This manpage was written by Nilesh Patra for the Debian distribution and
can be used for any other usage of the program.

April 2020 trim_galore 0.6.5