Scroll to navigation

BAMTOFASTQ(1) General Commands Manual BAMTOFASTQ(1)

NAME

bamtofastq - convert SAM, BAM or CRAM files to FastQ

SYNOPSIS

bamtofastq [options]

DESCRIPTION

bamtofastq reads a SAM, BAM or CRAM file from standard input and converts it to the FastQ format. The output can be split into multiple files according to the pair flags of the reads involved. bamtofastq can collate the source reads according to their read names, i.e. place pairs of reads next to each other in the output. bamtofastq writes its output to the standard output channel by default. All output channels can be compressed using gzip.

The following key=value pairs can be given:

F=<stdout>: output file for the first mates of pairs if collation is active.

F2=<stdout>: output file for the second mates of pairs if collation is active.

S=<stdout>: output file for single end reads if collation is active.

O=<stdout>: output file for unmatched (orphan) first mates if collation is active.

O2=<stdout>: output file for unmatched (orphan) second mates if collation is active.

collate=<0|1>: Valid values are

1:
collate read pairs
0:
output reads to standard output in the order in which they appear in the BAM file

combs=<0|1>: print some counts after finishing collation based output

filename=<stdin>: input file name (data is read from standard input if this option is not given)

inputformat=<bam>: input file format All versions of bamtofastq come with support for the BAM input format. If the program in addition is linked to the io_lib package, then the following options are valid:

BAM (see http://samtools.sourceforge.net/SAM1.pdf)
SAM (see http://samtools.sourceforge.net/SAM1.pdf)
CRAM (see http://www.ebi.ac.uk/ena/about/cram_toolkit)

reference=: file name of the reference for CRAM input files. If this key is unset, then the CRAM file header will be scanned for obtaining a reference file name.

exclude=<SECONDARY>: Do not include reads in the output that have any of the given flags set. The flags are given separated by commas. Valid flags are:

read was paired in sequencing
read has been mapped as part of a proper pair
read was not mapped
mate of read was not mapped
read was mapped to the reverse strand
mate of read was mapped to the reverse strand
read was first read of a pair during sequencing
read was second read of a pair during sequencing
alignment is secondary, i.e. an alternative mapping to the primary alignment in the same file
read as marked as having failed quality control
read is marked as a duplicate of another read in the same file (see bammarkduplicates)
read is marked as supplementary alignment

disablevalidation=<0>: Valid values are

0:
run input file validation on alignments (this is the default)
1:
do not check the validity of the input file (this may help for some broken input files, but it is a security risk as it can lead to the execution of arbitrary code through a forged input file).

colhlog=<18> base two logarithm of the size of the hash table used for collation (the default value is 18 and should work reasonably well for most input files. Please see the biobambam paper at arxiv.org/abs/1306.0836 for details).

colsbs=<128M> size of hash table overflow list in bytes (the default is 128MB and should work reasonably well for most input files. Please see the biobambam paper at arxiv.org/abs/1306.0836 for details).

T=<bamtofastq_hostname_pid_time> file name of temporary file used for collation

ranges=<>: coordinate ranges selected from input. This option is only available for input files in BAM and CRAM format which have a corresponding index file (.bai for BAM, .crai for CRAM) and if input is via file (i.e. the filename argument is set). Valid ranges consist of either

a whole reference sequence (e.g. "chr1")
an interval on a reference sequence half open on the right (e.g. "chr1:50000" which means alignments overlapping chr1 from position 50000 to the end of chr1)
an interval on a reference sequence (e.g. "chr1:50000-60000" which means alignments overlapping positions 50000 to 60000 on chr1)

For BAM input multiple ranges are separated by space characters (e.g. ranges="chr1:10000-20000 chr1:30000-40000"). CRAM input supports a single range only.

gz=<[0|1]>: compress output files using gzip. By default output is uncompressed.

level=<-1|0|1|9|11>: set compression level of the output FastQ/FastA files if gz=1. Valid values are

-1:
zlib/gzip default compression level
0:
uncompressed
1:
zlib/gzip level 1 (fast) compression
9:
zlib/gzip level 9 (best) compression

If libmaus has been compiled with support for igzip (see https://software.intel.com/en-us/articles/igzip-a-high-performance-deflate-compressor-with-optimizations-for-genomic-data) then an additional valid value is

11:
igzip compression

fasta=<0|1>: output FastA instead of FastQ if fasta=1.

outputperreadgroup=<0|1> split output by read group if outputperreadgroup=1 (default is 0). If splitting by read group is performed then no output is written on standard output but all data is written to files. The file names will be generated using the outputdir and outputperreadgroupsuffix parameters and read group names.

outputdir=<> output directory if outputperreadgroup=1. By default the output files are generated in the current directory.

outputperreadgrouprgsm=<0|1> include SM field of read group in output filenames if outputperreadgroup=1 (default is 0)

outputperreadgroupprefix= add given prefix ahead of file names if outputperreadgroup=1 (default is to add no prefix)

outputperreadgroupsuffixF=<_1.fq> output file name suffix for first mates of complete pairs if outputperreadgroup=1. Default is _1.fq if gz=0 and _1.fq.gz for gz=1.

outputperreadgroupsuffixF2=<_2.fq> output file name suffix for second mates of complete pairs if outputperreadgroup=1. Default is _2.fq if gz=0 and _2.fq.gz for gz=1.

outputperreadgroupsuffixO=<_o1.fq> output file name suffix for first mates of incomplete pairs if outputperreadgroup=1. Default is _o1.fq if gz=0 and _o1.fq.gz for gz=1.

outputperreadgroupsuffixO2=<_o2.fq> output file name suffix for second mates of incomplete pairs if outputperreadgroup=1. Default is _o2.fq if gz=0 and _o2.fq.gz for gz=1.

outputperreadgroupsuffixS=<_s.fq> output file name suffix for singled end reads if outputperreadgroup=1. Default is _s.fq if gz=0 and _s.fq.gz for gz=1.

tryoq=<0|1>: use content of OQ aux field if present instead of quality field when converting to FastQ. By default the quality field is used. This option is currently mutually exclusive with the tags option.

tags=<>: provide a comma separated list of aux fields which will be copied from the input alignment records to the comment section of the output FastQ records. By default no aux fields are copied. This option is currently mutually exclusive with the tryoq option.

split=<0>: split named output files into chunks of this number of reads. The output file names will be extended by _NNNNNN if gz=0 and by _NNNNNN.gz if gz=1 where NNNNNN denotes the NNNNNN+1'th output file (i.e. numbers start with 000000). The suffixes k, m, g, K, M and G can be used to denote that the argument is to be multiplied by 1024, 1024^2, 1024^3, 1000, 1000^2 or 1000^3 respectively.

cols=<>: If set to an unsigned number then wrap the sequence and quality lines at this number of columns. By default no wrapping is performed.

splitprefix=<bamtofastq_split>: file prefix if split>0 and collate=0.

casava18=<0>: produce read names as expected by the c18pe input option of fastqtobam using the ne aux fields produced by fastqtobam.

maxoutput=<>: produce no more than this number of output records. By default there is no limit. This option is only active for collate=0.

AUTHOR

Written by German Tischler.

REPORTING BUGS

Report bugs to <germant@miltenyibiotec.de>

COPYRIGHT

Copyright © 2009-2014 German Tischler, © 2011-2014 Genome Research Limited. License GPLv3+: GNU GPL version 3 <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law.

March 2014 BIOBAMBAM