Scroll to navigation

scramble(1) Staden io_lib scramble(1)

NAME

scramble - Converts between the SAM, BAM and CRAM file formats.

SYNOPSIS

scramble [options] [input_file [output_file]]

DESCRIPTION

scramble converts between various next-gen sequencing alignment file formats, including SAM, BAM and CRAM. It can either act as a pipe reading stdin and writing to stdout, or on named files.

When operating as a pipe the input type defaults to SAM or BAM, requiring the -I cram option to indicate input is in CRAM format is appropriate. The output defaults to BAM, but can be adjusted by using the -O format option. When given filenames the file type is automatically chosen based on the filename suffix.

OPTIONS

Selects the input format, where format is one of sam, bam or cram. Use this when reading via a pipe to avoid input bytes being consumed when attempting to detect if the input is in SAM or BAM format.

Selects the output format, where format is one of sam, bam or cram.

-1 to -9
Sets the compression level from 1 (low compression, fast) to 9 (high compression, slow) when writing in BAM or CRAM format. This is only used during writing.

-0 or -u
Writes uncompressed data. In BAM this still uses BGZF containers, but with no internal compression. In CRAM it stores blocks in RAW format instead. The option has no effect on SAM output.

CRAM encoding only. Add bzip2 to the list of compression codes potentially used during CRAM creation.

CRAM encoding only. Add lzma to the list of compression codes potentially used during CRAM creation. Given the slow compression speed of lzma, this may only be used where it gives a significant advantage over zlib or bzip2, but with higher compression levels (-7) this weighting is ignored as LZMA decompression speed is acceptable, albeit still slower than zlib.

CRAM decoding only. Generate MD:Z: and NM:I: auxiliary fields based on the reference-based compression.

CRAM encoding only. Forcibly pack sequences from multiple references into the same slice. Normally CRAM will start a new slice when changing from one reference to another, but will still automatically switch to multi-reference slices if the number of sequences per slice becomes too small.

Currently for CRAM input only, but SAM/BAM support is pending. This indicates a reference sequence name and optionally a start and end location within that reference, using the syntax ref_name or ref_name:start-end. For efficient operation the CRAM file needs a .crai format index (built using the cram_index program).

CRAM encoding only. Use this to specify the reference fasta file. Note that if the input SAM or BAM file a file: or local file system based URI specified in the @SQ headers then this option may not be necessary.

CRAM encoding only. Specifies the number of sequecnes per slice. Defaults to 10000.

CRAM encoding only. Specifies the number of slices per container. Defaults to 1.

BAM and CRAM only. Specifies the number of compression or decompression threads, adaptively shared between both encoding and decoding. Defaults to 1 (no threading).

CRAM encoding only. Sets the CRAM file format version. Supported values are "2.0", "2.1" and "3.0".

CRAM encoding only. Embed snippets of the reference sequence in every slice. This means the files can be decoded without needing to specify the reference fasta file.

CRAM encoding only. Embed snippets of the consensus sequence in every slice. This operates as per the -e option, but the consensus is generated from the aligned data. This does not therefore require a reference to be known during encode (although it is still a mandatory part of the specification that the SQ SAM headers have an M5 field). It also means the files can be decoded without needing to specify the reference fasta file.

CRAM encoding only. Omit reference based compression and instead store details of every base verbatim.

Experimental, encoding only. When storing quality values, bin into 8 discrete values (plus 0), as typically used by modern Illumina instruments. (Note that the bins may not be precisely the same ranges.)

-!
CRAM v3.0 and above decoding only. Do not check CRCs. This option should only be used when attempting to recover from a data corruption.

Do not append @PG header lines with the scramble program name and arguments.

Encode CRAM using a set of predefined parameters defined by mode. This are one of fast, normal / default, small or archive.
Lightweight compression for speed and small slice size for quick fine-grained random access.
Default mode. This is the same as not specifying -X. For version 3.1 onwards this enables the name tokeniser ("-T").
Optimise for smaller files, with larger slices.
Optimised for smallest files, intended for data archival. This uses a large slice size and will have poorer random access. At level 7 onwards this also enables lzma compression if compiled in ("-Z").

Discard all auxiliary tags except those listed in tag-list. The list is comma separated and contains the two letter tag codes specified as-is or with simplified regular expressions. Character classes such as "[A-W]" are permitted, but not with the negation code "^". Also "." is a synonym for any legal tag character. Hence "[A-Z][A-Z0-9]" represents all tag types belonging to the official namespace.

The option may be specified more than once, but it cannot be mixed with -D.

Discard auxiliary tags listed in tag-list, keeping everything else. The list is comma separated and contains only the two letter tag codes. As with -d tag-list can be specified using a simplified regular expression. This means -D .. removes all auxiliary tags.

The option may be specified more than once, but it cannot be mixed with -d.

EXAMPLES

To convert a BAM file from stdin to CRAM on stdout, using reference MT.fa.


some_command | scramble -I bam -O cram -r MT.fa | some_command

The default CRAM output format is version 3.0. The command below enables the experimental newer compression codecs (NB: do not use this in production) using the "small" profile, while also removing all tag types reserved for local/private use. (Also consider -d [A-Z][A-Z0-9] instead of the -D arguments.)


scramble -V 3.1 -X small -D [a-zXYZ]. -D.[a-z] in.cram out.cram

AUTHOR

James Bonfield, Wellcome Trust Sanger Institute

December 6 2022