NAME¶
Bio::Assembly::IO::sam - An IO module for assemblies in Sam format *BETA*
SYNOPSIS¶
$aio = Bio::Assembly::IO( -file => "mysam.bam",
-refdb => "myrefseqs.fas");
$assy = $aio->next_assembly;
DESCRIPTION¶
This is a (currently) read-only IO module designed to convert Sequence/Alignment
Map (SAM; <
http://samtools.sourceforge.net/>) formatted alignments to
Bio::Assembly::Scaffold representations, containing .Bio::Assembly::Contig and
Bio::Assembly::Singlet objects. It uses lstein's Bio::DB::Sam to parse binary
formatted SAM (.bam) files guided by a reference sequence fasta database.
NB: "Bio::DB::Sam" is not a BioPerl module; it can be obtained
via CPAN. It in turn requires the "libbam" library; source can be
downloaded at <
http://samtools.sourceforge.net/>.
DETAILS¶
- •
- Required files
A binary SAM (".bam") alignment and a reference sequence database
in FASTA format are required. Various required indexes (".fai",
".bai") will be created as necessary (via Bio::DB::Sam).
- •
- Compressed files
...can be specified directly , if IO::Uncompress::Gunzip is installed. Get
it from your local CPAN mirror.
- •
- BAM vs. SAM
The input alignment should be in (possibly gzipped) binary SAM
(".bam") format. If it isn't, you will get a message explaining
how to convert it, viz.:
$ samtools view -Sb mysam.sam > mysam.bam
The bam file must also be sorted on coordinates: do
$ samtools sort mysam.unsorted.bam > mysam.bam
- •
- Contigs
Contigs are calculated by this module, using the 'coverage' feature of the
Bio::DB::Sam object. A contig represents a contiguous portion of a
reference sequence having non-zero coverage at each base.
The bwa assembler (http://bio-bwa.sourceforge.net/
<http://bio-bwa.sourceforge.net/>) can assign read sequences to
multiple reference sequence locations. The present implementation
currently assigns such reads only to the first contig in which they
appear.
- •
- Consensus sequences
Consensus sequence and quality objects are calculated by this module, using
the "pileup" callback feature of "Bio::DB::Sam". The
consensus is (currently) simply the residue at a position that has the
maximum sum of quality values. The consensus quality is the integer
portion of the simple average of quality values for the consensus
residue.
- •
- SeqFeatures
Read sequences stored in contigs are accompanied by the following features:
contig : name of associated contig
cigar : CIGAR string for this read
If the read is paired with a successfully mapped mate, these features will
also be available:
mate_start : coordinate of to which the mate was aligned
mate_len : length of mate read
mate_strand : strand of mate (-1 or 1)
insert_size : size of insert spanned by the mate pair
These features are obtained as follows:
@ids = $contig->get_seq_ids;
$an_id = $id[0]; # or whatever
$seq = $contig->get_seq_by_name($an_id);
# Bio::LocatableSeq's aren't SeqFeature containers, so...
$feat = $contig->get_seq_feat_by_tag(
$seq, "_aligned_coord:".$s->id
);
($cigar) = $feat->get_tag_values('cigar');
# etc.
TODO¶
- •
- Supporting both text SAM (TAM) and binary SAM (BAM)
FEEDBACK¶
Mailing Lists¶
User feedback is an integral part of the evolution of this and other Bioperl
modules. Send your comments and suggestions preferably to the Bioperl mailing
list. Your participation is much appreciated.
bioperl-l@bioperl.org - General discussion
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
Support¶
Please direct usage questions or support issues to the mailing list:
bioperl-l@bioperl.org
rather than to the module maintainer directly. Many experienced and reponsive
experts will be able look at the problem and quickly address it. Please
include a thorough description of the problem with code and data examples if
at all possible.
Reporting Bugs¶
Report bugs to the Bioperl bug tracking system to help us keep track of the bugs
and their resolution. Bug reports can be submitted via the web:
https://redmine.open-bio.org/projects/bioperl/
AUTHOR - Mark A. Jensen¶
Email maj -at- fortinbras -dot- us
APPENDIX¶
The rest of the documentation details each of the object methods. Internal
methods are usually preceded with a _
Bio::Assembly::IO compliance¶
next_assembly()¶
Title : next_assembly
Usage : my $scaffold = $asmio->next_assembly();
Function: return the next assembly in the sam-formatted stream
Returns : Bio::Assembly::Scaffold object
Args : none
next_contig()¶
Title : next_contig
Usage : my $contig = $asmio->next_contig();
Function: return the next contig or singlet from the sam stream
Returns : Bio::Assembly::Contig or Bio::Assembly::Singlet
Args : none
write_assembly()¶
Title : write_assembly
Usage :
Function: not implemented (module currrently read-only)
Returns :
Args :
Internal¶
_store_contig()¶
Title : _store_contig
Usage : my $contigobj = $self->_store_contig(\%contiginfo);
Function: create and load a contig object
Returns : Bio::Assembly::Contig object
Args : Bio::DB::Sam::Segment object
_store_read()¶
Title : _store_read
Usage : my $readobj = $self->_store_read($readobj, $contigobj);
Function: store information of a read belonging to a contig in a contig object
Returns : Bio::LocatableSeq
Args : Bio::DB::Bam::AlignWrapper, Bio::Assembly::Contig
_store_singlet()¶
Title : _store_singlet
Usage : my $singletobj = $self->_store_singlet($contigobj);
Function: convert a contig object containing a single read into
a singlet object
Returns : Bio::Assembly::Singlet
Args : Bio::Assembly::Contig (previously loaded with only one seq)
REALLY Internal¶
_init_sam()¶
Title : _init_sam
Usage : $self->_init_sam($fasfile)
Function: obtain a Bio::DB::Sam parsing of the associated sam file
Returns : true on success
Args : [optional] name of the fasta reference db (scalar string)
Note : The associated file can be plain text (.sam) or binary (.bam);
If the fasta file is not specified, and no file is contained in
the refdb() attribute, a .fas file with the same
basename as the sam file will be searched for.
_get_contig_segs_from_coverage()¶
Title : _get_contig_segs_from_coverage
Usage :
Function: calculates separate contigs using coverage info
in the segment
Returns : array of Bio::DB::Sam::Segment objects, representing
each contig
Args : Bio::DB::Sam::Segment object
_calc_consensus_quality()¶
Title : _calc_consensus_quality
Usage : @qual = $aio->_calc_consensus_quality( $contig_seg );
Function: calculate an average or other data-reduced quality
over all sites represented by the features contained
in a Bio::DB::Sam::Segment
Returns :
Args : a Bio::DB::Sam::Segment object
_calc_consensus()¶
Title : _calc_consensus
Usage : @qual = $aio->_calc_consensus( $contig_seg );
Function: calculate a simple quality-weighted consensus sequence
for the segment
Returns : a SeqWithQuality object
Args : a Bio::DB::Sam::Segment object
refdb()¶
Title : refdb
Usage : $obj->refdb($newval)
Function: the name of the reference db fasta file
Example :
Returns : value of refdb (a scalar)
Args : on set, new value (a scalar or undef, optional)
_segset()¶
Title : _segset
Usage : $segset_hashref = $self->_segset()
Function: hash container for the Bio::DB::Sam::Segment objects that
represent each set of contigs for each seq_id
{ $seq_id => [@contig_segments], ... }
Example :
Returns : value of _segset (a hashref) if no arg,
or the arrayref of contig segments, if arg == a seq id
Args : [optional] seq id (scalar string)
Note : readonly; hash elt set in _init_sam()
_current_refseq_id()¶
Title : _current_refseq_id
Usage : $obj->_current_refseq_id($newval)
Function: the "current" reference sequence id
Example :
Returns : value of _current_refseq (a scalar)
Args : on set, new value (a scalar or undef, optional)
_current_contig_seg_idx()¶
Title : current_contig_seg_idx
Usage : $obj->current_contig_seg_idx($newval)
Function: the "current" segment index in the "current" refseq
Example :
Returns : value of current_contig_seg_idx (a scalar)
Args : on set, new value (a scalar or undef, optional)
sam()¶
Title : sam
Usage : $obj->sam($newval)
Function: store the associated Bio::DB::Sam object
Example :
Returns : value of sam (a Bio::DB::Sam object)
Args : on set, new value (a scalar or undef, optional)