.\" Automatically generated by Pod::Man 4.14 (Pod::Simple 3.40) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is >0, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{\ . if \nF \{\ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{\ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" ======================================================================== .\" .IX Title "Bio::SeqIO::msout 3pm" .TH Bio::SeqIO::msout 3pm "2021-08-15" "perl v5.32.1" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" Bio::SeqIO::msout \- input stream for output by Hudson's ms .SH "SYNOPSIS" .IX Header "SYNOPSIS" Do not use this module directly. Use it via the Bio::SeqIO class. .SH "DESCRIPTION" .IX Header "DESCRIPTION" ms ( Hudson, R. R. (2002) Generating samples under a Wright-Fisher neutral model. Bioinformatics 18:337\-8 ) can be found at http://home.uchicago.edu/~rhudson1/source/mksamples.html. .PP Currently, this object can be used to read output from ms into seq objects. However, because bioperl has no support for haplotypes created using an infinite sites model (where '1' identifies a derived allele and '0' identifies an ancestral allele), the sequences returned by msout are coded using A, T, C and G. To decode the bases, use the sequence conversion table (a hash) returned by \&\fBget_base_conversion_table()\fR. In the table, 4 and 5 are used when the ancestry is unclear. This should not ever happen when creating files with ms, but it will be used when creating msOUT files from a collection of seq objects ( To be added later ). Alternatively, use \fBget_next_hap()\fR to get a string with 1's and 0's instead of a seq object. .SS "Mapping to Finite Sites" .IX Subsection "Mapping to Finite Sites" This object can now also be used to map haplotypes created using an infinite sites model to sequences of arbitrary finite length. See \fBset_n_sites()\fR for more detail. Thanks to Filipe G. Vieira for the idea and code. .SH "FEEDBACK" .IX Header "FEEDBACK" .SS "Mailing Lists" .IX Subsection "Mailing Lists" User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and suggestions preferably to the Bioperl mailing list. Your participation is much appreciated. .PP .Vb 2 \& bioperl\-l@bioperl.org \- General discussion \& http://bioperl.org/wiki/Mailing_lists \- About the mailing lists .Ve .SS "Reporting Bugs" .IX Subsection "Reporting Bugs" Report bugs to the Bioperl bug tracking system to help us keep track of the bugs and their resolution. Bug reports can be submitted via the web: .PP .Vb 1 \& https://github.com/bioperl/bioperl\-live/issues .Ve .SH "AUTHOR \- Warren Kretzschmar" .IX Header "AUTHOR - Warren Kretzschmar" This module was written by Warren Kretzschmar .PP email: wkretzsch@gmail.com .PP This module grew out of a parser written by Aida Andres. .SH "COPYRIGHT" .IX Header "COPYRIGHT" .SS "Public Domain Notice" .IX Subsection "Public Domain Notice" This software/database is ``United States Government Work'' under the terms of the United States Copyright Act. It was written as part of the authors' official duties for the United States Government and thus cannot be copyrighted. This software/database is freely available to the public for use without a copyright notice. Restrictions cannot be placed on its present or future use. .PP Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the National Human Genome Research Institute (\s-1NHGRI\s0) and the U.S. Government does not and cannot warrant the performance or results that may be obtained by using this software or data. \s-1NHGRI\s0 and the U.S. Government disclaims all warranties as to performance, merchantability or fitness for any particular purpose. .SH "METHODS" .IX Header "METHODS" .SS "Methods for Internal Use" .IX Subsection "Methods for Internal Use" \fI_initialize\fR .IX Subsection "_initialize" .PP Title : _initialize Usage : \f(CW$stream\fR = Bio::SeqIO::msOUT\->new($infile) Function: extracts basic information about the file. Returns : Bio::SeqIO object Args : no_og, gunzip, gzip, n_sites Details : \- include 'no_og' flag if the last population of an msout file contains only one haplotype and you don't want the last haplotype to be treated as the outgroup ( suggested when reading data created by ms ). \- including 'n_sites' (positive integer) causes all output haplotypes to be mapped to a sequence of length 'n_sites'. See \fBset_n_sites()\fR for more details. .PP \fI_read_start\fR .IX Subsection "_read_start" .PP Title : _read_start Usage : \f(CW$stream\fR\->\fB_read_start()\fR Function: reads from the filehandle \f(CW$stream\fR\->{_filehandle} all information up to the first haplotype (sequence). Closes the filehandle if all lines have been read. Returns : void Args : none .SS "Methods to Access Data" .IX Subsection "Methods to Access Data" \fIget_segsites\fR .IX Subsection "get_segsites" .PP Title : get_segsites Usage : \f(CW$segsites\fR = \f(CW$stream\fR\->\fBget_segsites()\fR Function: returns the number of segsites in the msOUT file (according to the msOUT header line's \-s option), or the current run's segsites if \-s was not specified in the command line (in this case the number of segsites varies from run to run). Returns : scalar Args : \s-1NONE\s0 .PP \fIget_current_run_segsites\fR .IX Subsection "get_current_run_segsites" .PP Title : get_current_run_segsites Usage : \f(CW$segsites\fR = \f(CW$stream\fR\->\fBget_current_run_segsites()\fR Function: returns the number of segsites in the run of the last read haplotype (sequence). Returns : scalar Args : \s-1NONE\s0 .PP \fIget_n_sites\fR .IX Subsection "get_n_sites" .PP Title : get_n_sites Usage : \f(CW$n_sites\fR = \f(CW$stream\fR\->\fBget_n_sites()\fR Function: Gets the number of total sites (variable or not) to be output. Returns : scalar if n_sites option is defined at call time of \fBnew()\fR Args : \s-1NONE\s0 Note : \s-1WARNING:\s0 Final sequence length might not be equal to n_sites if n_sites is too close to number of segregating sites in the msout file. .PP \fIset_n_sites\fR .IX Subsection "set_n_sites" .PP Title : set_n_sites Usage : \f(CW$n_sites\fR = \f(CW$stream\fR\->set_n_sites($value) Function: Sets the number of total sites (variable or not) to be output. Returns : 1 on success; throws an error if \f(CW$value\fR is not a positive integer or undef Args : positive integer Note : \s-1WARNING:\s0 Final sequence length might not be equal to n_sites if it is too close to number of segregating sites. \- n_sites needs to be at least as large as the number of segsites of the next haplotype returned \- n_sites may also be set to undef, in which case haplotypes are returned under the infinite sites model assumptions. .PP \fIget_runs\fR .IX Subsection "get_runs" .PP Title : get_runs Usage : \f(CW$runs\fR = \f(CW$stream\fR\->\fBget_runs()\fR Function: returns the number of runs in the msOUT file (according to the msinfo line) Returns : scalar Args : \s-1NONE\s0 .PP \fIget_Seeds\fR .IX Subsection "get_Seeds" .PP Title : get_Seeds Usage : \f(CW@seeds\fR = \f(CW$stream\fR\->\fBget_Seeds()\fR Function: returns an array of the seeds used in the creation of the msOUT file. Returns : array Args : \s-1NONE\s0 Details : In older versions, ms used three seeds. Newer versions of ms seem to use only one (longer) seed. This function will return all the seeds found. .PP \fIget_Positions\fR .IX Subsection "get_Positions" .PP Title : get_Positions Usage : \f(CW@positions\fR = \f(CW$stream\fR\->\fBget_Positions()\fR Function: returns an array of the names of each segsite of the run of the last read hap. Returns : array Args : \s-1NONE\s0 Details : The Positions may or may not vary from run to run depending on the options used with ms. .PP \fIget_tot_run_haps\fR .IX Subsection "get_tot_run_haps" .PP Title : get_tot_run_haps Usage : \f(CW$number_of_haps_per_run\fR = \f(CW$stream\fR\->\fBget_tot_run_haps()\fR Function: returns the number of haplotypes (sequences) in each run of the msOUT file ( according to the msinfo line ). Returns : scalar >= 0 Args : \s-1NONE\s0 Details : This number should not vary from run to run. .PP \fIget_ms_info_line\fR .IX Subsection "get_ms_info_line" .PP Title : get_ms_info_line Usage : \f(CW$ms_info_line\fR = \f(CW$stream\fR\->\fBget_ms_info_line()\fR Function: returns the header line of the msOUT file. Returns : scalar Args : \s-1NONE\s0 .PP \fItot_haps\fR .IX Subsection "tot_haps" .PP Title : tot_haps Usage : \f(CW$number_of_haplotypes_in_file\fR = \f(CW$stream\fR\->\fBtot_haps()\fR Function: returns the number of haplotypes (sequences) in the msOUT file. Information gathered from msOUT header line. Returns : scalar Args : \s-1NONE\s0 .PP \fIget_Pops\fR .IX Subsection "get_Pops" .PP Title : get_Pops Usage : \f(CW@pops\fR = \f(CW$stream\fR\->\fBpops()\fR Function: returns an array of population sizes (order taken from the \-I flag in the msOUT header line). This array will include the last hap even if it looks like an outgroup. Returns : array of scalars > 0 Args : \s-1NONE\s0 .PP \fIget_next_run_num\fR .IX Subsection "get_next_run_num" .PP Title : get_next_run_num Usage : \f(CW$next_run_number\fR = \f(CW$stream\fR\->\fBnext_run_num()\fR Function: returns the number of the ms run that the next haplotype (sequence) will be taken from (starting at 1). Returns undef if the complete file has been read. Returns : scalar > 0 or undef Args : \s-1NONE\s0 .PP \fIget_last_haps_run_num\fR .IX Subsection "get_last_haps_run_num" .PP Title : get_last_haps_run_num Usage : \f(CW$last_haps_run_number\fR = \f(CW$stream\fR\->\fBget_last_haps_run_num()\fR Function: returns the number of the ms run that the last haplotype (sequence) was taken from (starting at 1). Returns undef if no hap has been read yet. Returns : scalar > 0 or undef Args : \s-1NONE\s0 .PP \fIget_last_read_hap_num\fR .IX Subsection "get_last_read_hap_num" .PP Title : get_last_read_hap_num Usage : \f(CW$last_read_hap_num\fR = \f(CW$stream\fR\->\fBget_last_read_hap_num()\fR Function: returns the number (starting with 1) of the last haplotype read from the ms file Returns : scalar >= 0 Args : \s-1NONE\s0 Details : 0 means that no haplotype has been read yet. Is reset to 0 every run. .PP \fIoutgroup\fR .IX Subsection "outgroup" .PP Title : outgroup Usage : \f(CW$outgroup\fR = \f(CW$stream\fR\->\fBoutgroup()\fR Function: returns '1' if the msOUT stream has an outgroup. Returns '0' otherwise. Returns : '1' or '0' Args : \s-1NONE\s0 Details : This method will return '1' only if the last population in the msOUT file contains only one haplotype. If the last population is not an outgroup then create the msOUT object using 'no_og' as input flag. Also, return 0, if the run has only one population. .PP \fIget_next_haps_pop_num\fR .IX Subsection "get_next_haps_pop_num" .PP Title : get_next_haps_pop_num Usage : ($next_haps_pop_num, \f(CW$num_haps_left_in_pop\fR) = \f(CW$stream\fR\->\fBget_next_haps_pop_num()\fR Function: First return value is the population number (starting with 1) the next hap will come from. The second return value is the number of haps left to read in the population from which the next hap will come. Returns : (scalar > 0, scalar > 0) Args : \s-1NONE\s0 .PP \fIget_next_seq\fR .IX Subsection "get_next_seq" .PP Title : get_next_seq Usage : \f(CW$seq\fR = \f(CW$stream\fR\->\fBget_next_seq()\fR Function: reads and returns the next sequence (haplotype) in the stream Returns : Bio::Seq object or void if end of file Args : \s-1NONE\s0 Note : This function is included only to conform to convention. The returned Bio::Seq object holds a halpotype in coded form. Use the hash returned by \fBget_base_conversion_table()\fR to convert 'A', 'T', 'C', 'G' back into 1,2,4 and 5. Use \fBget_next_hap()\fR to retrieve the halptoype as a string of 1,2,4 and 5s instead. .PP \fInext_seq\fR .IX Subsection "next_seq" .PP Title : next_seq Usage : \f(CW$seq\fR = \f(CW$stream\fR\->\fBnext_seq()\fR Function: Alias to \fBget_next_seq()\fR Returns : Bio::Seq object or void if end of file Args : \s-1NONE\s0 Note : This function is only included for convention. It calls \fBget_next_seq()\fR. See \fBget_next_seq()\fR for details. .PP \fIget_next_hap\fR .IX Subsection "get_next_hap" .PP Title : get_next_hap Usage : \f(CW$hap\fR = \f(CW$stream\fR\->\fBnext_hap()\fR Function: reads and returns the next sequence (haplotype) in the stream. Returns undef if all sequences in stream have been read. Returns : Haplotype string (e.g. '110110000101101045454000101' Args : \s-1NONE\s0 Note : Use \fBget_next_seq()\fR if you want the halpotype returned as a Bio::Seq object. .PP \fIget_next_pop\fR .IX Subsection "get_next_pop" .PP Title : get_next_pop Usage : \f(CW@seqs\fR = \f(CW$stream\fR\->\fBnext_pop()\fR Function: reads and returns all the remaining sequences (haplotypes) in the population of the next sequence. Returns an empty list if no more haps remain to be read in the stream Returns : array of Bio::Seq objects Args : \s-1NONE\s0 .PP \fInext_run\fR .IX Subsection "next_run" .PP Title : next_run Usage : \f(CW@seqs\fR = \f(CW$stream\fR\->\fBnext_run()\fR Function: reads and returns all the remaining sequences (haplotypes) in the ms run of the next sequence. Returns an empty list if all haps have been read from the stream. Returns : array of Bio::Seq objects Args : \s-1NONE\s0 .SS "Methods to Retrieve Constants" .IX Subsection "Methods to Retrieve Constants" \fIbase_conversion_table\fR .IX Subsection "base_conversion_table" .PP Title : get_base_conversion_table Usage : \f(CW$table_hash_ref\fR = \f(CW$stream\fR\->\fBget_base_conversion_table()\fR Function: returns a reference to a hash. The keys of the hash are the letters ' A','T','G','C'. The values associated with each key are the value that each letter in the sequence of a seq object returned by a Bio::SeqIO::msout stream should be translated to. Returns : reference to a hash Args : \s-1NONE\s0 Synopsis: .PP .Vb 2 \& # retrieve the Bio::Seq object\*(Aqs sequence \& my $haplotype = $seq\->seq; \& \& # need to convert all letters to their corresponding numbers. \& foreach my $base (keys %{$rh_base_conversion_table}){ \& $haplotype =~ s/($base)/$rh_base_conversion_table\->{$base}/g; \& } \& \& # $haplotype is now an ms style haplotype. (e.g. \*(Aq100101101455\*(Aq) .Ve