.\" Automatically generated by Pod::Man 4.11 (Pod::Simple 3.35)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings.  \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote.  \*(C+ will
.\" give a nicer C++.  Capital omega is used to do unbreakable dashes and
.\" therefore won't be available.  \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
.    ds -- \(*W-
.    ds PI pi
.    if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
.    if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\"  diablo 12 pitch
.    ds L" ""
.    ds R" ""
.    ds C` ""
.    ds C' ""
'br\}
.el\{\
.    ds -- \|\(em\|
.    ds PI \(*p
.    ds L" ``
.    ds R" ''
.    ds C`
.    ds C'
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el       .ds Aq '
.\"
.\" If the F register is >0, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD.  Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.\"
.\" Avoid warning from groff about undefined register 'F'.
.de IX
..
.nr rF 0
.if \n(.g .if rF .nr rF 1
.if (\n(rF:(\n(.g==0)) \{\
.    if \nF \{\
.        de IX
.        tm Index:\\$1\t\\n%\t"\\$2"
..
.        if !\nF==2 \{\
.            nr % 0
.            nr F 2
.        \}
.    \}
.\}
.rr rF
.\"
.\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2).
.\" Fear.  Run.  Save yourself.  No user-serviceable parts.
.    \" fudge factors for nroff and troff
.if n \{\
.    ds #H 0
.    ds #V .8m
.    ds #F .3m
.    ds #[ \f1
.    ds #] \fP
.\}
.if t \{\
.    ds #H ((1u-(\\\\n(.fu%2u))*.13m)
.    ds #V .6m
.    ds #F 0
.    ds #[ \&
.    ds #] \&
.\}
.    \" simple accents for nroff and troff
.if n \{\
.    ds ' \&
.    ds ` \&
.    ds ^ \&
.    ds , \&
.    ds ~ ~
.    ds /
.\}
.if t \{\
.    ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u"
.    ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'
.    ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'
.    ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'
.    ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'
.    ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'
.\}
.    \" troff and (daisy-wheel) nroff accents
.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'
.ds 8 \h'\*(#H'\(*b\h'-\*(#H'
.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#]
.ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'
.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'
.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#]
.ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#]
.ds ae a\h'-(\w'a'u*4/10)'e
.ds Ae A\h'-(\w'A'u*4/10)'E
.    \" corrections for vroff
.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'
.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'
.    \" for low resolution devices (crt and lpr)
.if \n(.H>23 .if \n(.V>19 \
\{\
.    ds : e
.    ds 8 ss
.    ds o a
.    ds d- d\h'-1'\(ga
.    ds D- D\h'-1'\(hy
.    ds th \o'bp'
.    ds Th \o'LP'
.    ds ae ae
.    ds Ae AE
.\}
.rm #[ #] #H #V #F C
.\" ========================================================================
.\"
.IX Title "NCBI-SEG 1"
.TH NCBI-SEG 1 "2020-07-21" "0.0.20000620" "User Commands"
.\" For nroff, turn off justification.  Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH "NAME"
ncbi\-seg \- segment sequence(s) by local complexity
.SH "SYNOPSIS"
.IX Header "SYNOPSIS"
ncbi-seg sequence [ W ] [ K(1) ] [ K(2) ] [ \-x ] [ options ]
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
ncbi-seg divides sequences into contrasting segments of low-complexity
and high-complexity.  Low-complexity segments defined by the
algorithm represent \*(L"simple sequences\*(R" or \*(L"compositionally-biased
regions\*(R".
.PP
Locally-optimized low-complexity segments are produced at defined
levels of stringency, based on formal definitions of local
compositional complexity (Wootton & Federhen, 1993).  The segment
lengths and the number of segments per sequence are determined
automatically by the algorithm.
.PP
The input is a FASTA-formatted sequence file, or a database file
containing many FASTA-formatted  sequences.  ncbi-seg is tuned for amino
acid sequences.  For nucleotide sequences, see \s-1EXAMPLES OF
PARAMETER SETS\s0 below.
.PP
The stringency of the search for low-complexity segments is
determined by three user-defined parameters, trigger window length
[ W ], trigger complexity [ K(1) ] and extension complexity [ K(2)]
(see below under \s-1PARAMETERS\s0 ).  The defaults provided are suitable
for low-complexity masking of database search query sequences [ \-x
option required, see below].
.SH "OUTPUTS AND APPLICATIONS"
.IX Header "OUTPUTS AND APPLICATIONS"
(1) Readable segmented sequence [Default].  Regions of contrasting
complexity are displayed in \*(L"tree format\*(R".  See \s-1EXAMPLES.\s0
.PP
(2) Low-complexity masking (see Altschul et al, 1994).  Produce a
masked FASTA-formatted file, ready for  input as a query sequence for
database search programs such as \s-1BLAST\s0 or \s-1FASTA.\s0  The amino acids in
low-complexity regions are replaced with \*(L"x\*(R" characters [\-x option].
See \s-1EXAMPLES.\s0
.PP
(3) Database construction.  Produce FASTA-formatted files containing
low-complexity segments [\-l  option], or high-complexity segments
[\-h option], or both [\-a option].  Each segment is a separate
sequence entry with an informative header line.
.SH "ALGORITHM"
.IX Header "ALGORITHM"
The \s-1SEG\s0 algorithm has two stages.  First, identification of
approximate raw segments of low\- complexity; second local
optimization.
.PP
At the first stage, the stringency and resolution of the search for
low-complexity segments is determined  by the W, K(1) and K(2)
parameters.  All trigger windows are defined, including overlapping
windows, of length W and complexity less than or equal to K(1).
\&\*(L"Complexity\*(R" here is defined by equation  (3) of Wootton & Federhen
(1993).  Each trigger window is then extended into a contig in both
directions by merging with extension windows, which are overlapping
windows of length W and complexity  less than or equal to K(2).
Each contig is a raw segment.
.PP
At the second stage, each raw segment is reduced to a single
optimal low-complexity segment, which  may be the entire raw
segment but is usually a subsequence.  The optimal subsequence has
the lowest  value of the probability P(0) (equation (5) of Wootton
& Federhen, 1993).
.SH "PARAMETERS"
.IX Header "PARAMETERS"
These three numeric parameters are in obligatory order after the
sequence file name.
.PP
Trigger window length [ W ].  An integer greater than zero [ Default
12 ].
.PP
Trigger complexity. [ K1 ].  The maximum complexity of a trigger
window in units of bits. K1 must  be equal to or greater than zero.
The maximum value is 4.322 (log[base 2]20) for amino acid
sequences [ Default 2.2 ].
.PP
Extension complexity [ K2 ].  The maximum complexity of an extension
window in units of bits.  Only values greater than K1 are effective
in extending triggered windows.  Range of possible values is as for
K1 [ Default 2.5 ].
.SH "OPTIONS"
.IX Header "OPTIONS"
The following options may be placed in any order in the command
line after the W, K1 and K2 parameters:
.IP "\-a" 4
.IX Item "-a"
Output both low-complexity and high-complexity segments in a
FASTA-formatted file, as a set of  separate entries with header
lines.
.IP "\-c  [characters\-per\-line]" 4
.IX Item "-c [characters-per-line]"
Number of sequence characters per line of
output [Default 60].  Other characters, such as residue numbers, are additional.
.IP "\-h" 4
.IX Item "-h"
Output only the high-complexity segments in a FASTA-formatted
file, as a set of separate entries  with header lines.
.IP "\-l" 4
.IX Item "-l"
Output only the low-complexity segments in a FASTA-formatted
file, as a set of separate entries with  header lines.
.IP "\-m  [length]" 4
.IX Item "-m [length]"
Minimum length in residues for a high-complexity
segment [default 0].  Shorter segments are merged with adjacent
low-complexity segments.
.IP "\-o" 4
.IX Item "-o"
Show all overlapping, independently-triggered low-complexity
segments [these are merged by default].
.IP "\-q" 4
.IX Item "-q"
Produce an output format with the sequence in a numbered block
with markings to assist residue counting.  The low-complexity and
high-complexity segments are in lower\- and upper-case characters
respectively.
.IP "\-t  [length]" 4
.IX Item "-t [length]"
\&\*(L"Maximum trim length\*(R" parameter [default 100]. This
controls the search space (and  search time) during the
optimization of raw segments (see \s-1ALGORITHM\s0 above).  By default,
subsequences 100 or more residues shorter than the raw segment are
omitted from the search. This parameter may be increased to give
a more extensive search if raw segments are longer than 100 residues.
.IP "\-x" 4
.IX Item "-x"
The masking option for amino acid sequences.  Each input
sequence is represented by a single output sequence in FASTA-format
with low-complexity regions replaced by strings of \*(L"x\*(R" characters.
.SH "EXAMPLES OF PARAMETER SETS"
.IX Header "EXAMPLES OF PARAMETER SETS"
Default parameters are given by 'ncbi\-seg sequence' (equivalent to 'ncbi\-seg
sequence 12 2.2 2.5').  These  parameters are appropriate for low\-
complexity masking of many amino acid sequences [with \-x option  ].
.SS "Database-database comparisons:"
.IX Subsection "Database-database comparisons:"
More stringent (lower) complexity parameters are suitable when
masked sequences are compared with masked sequences.  For example,
for \s-1BLAST\s0 or \s-1FASTA\s0 searches that compare two amino acid sequence
databases, the following masking may be applied to both databases:
.PP
.Vb 1
\&  ncbi\-seg database 12 1.8 2.0 \-x
.Ve
.SS "Homopolymer analysis:"
.IX Subsection "Homopolymer analysis:"
To examine all homopolymeric subsequences of length (for example)
7 or greater:
.PP
.Vb 1
\&  ncbi\-seg sequence 7 0 0
.Ve
.SS "Non-globular regions of protein sequences:"
.IX Subsection "Non-globular regions of protein sequences:"
Many long non-globular domains may be diagnosed at longer window
lengths, typically:
.PP
.Vb 1
\&  ncbi\-seg sequence 45 3.4 3.75
.Ve
.PP
For some shorter non-globular domains, the following set is
appropriate:
.PP
.Vb 1
\&  ncbi\-seg sequence 25 3.0 3.3
.Ve
.SS "Nucleotide sequences:"
.IX Subsection "Nucleotide sequences:"
The maximum value of the complexity parameters is 2 (log[base 2]4).
For masking, the following is approximately equivalent in effect
to the default parameters for amino acid sequences:
.PP
.Vb 1
\&  ncbi\-seg sequence.na 21 1.4 1.6
.Ve
.SH "EXAMPLES"
.IX Header "EXAMPLES"
The following is a file named 'prion' in \s-1FASTA\s0 format:
.PP
.Vb 6
\& >PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR
\& MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYPPQGGGGWGQP
\& HGGGWGQPHGGGWGQPHGGGWGQPHGGGWGQGGGTHSQWNKPSKPKTNMKHMAGAAAAGA
\& VVGGLGGYMLGSAMSRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCV
\& NITIKQHTVTTTTKGENFTETDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSSPPV
\& ILLISFLIFLIVG
.Ve
.PP
The command line:
.PP
.Vb 1
\& ncbi\-seg /usr/share/doc/ncbi\-seg/examples/prion.fa
.Ve
.PP
gives the standard output below
.PP
.Vb 1
\& >PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR
\&
\&                                   1\-49   MANLGCWMLVLFVATWSDLGLCKKRPKPGG
\&                                          WNTGGSRYPGQGSPGGNRY
\& ppqggggwgqphgggwgqphgggwgqphgg   50\-94
\&                gwgqphgggwgqggg
\&                                  95\-112  THSQWNKPSKPKTNMKHM
\&        agaaaagavvgglggymlgsams  113\-135
\&                                 136\-187  RPIIHFGSDYEDRYYRENMHRYPNQVYYRP
\&                                          MDEYSNQNNFVHDCVNITIKQH
\&                 tvttttkgenftet  188\-201
\&                                 202\-236  DVKMMERVVEQMCITQYERESQAYYQRGSS
\&                                          MVLFS
\&               sppvillisflifliv  237\-252
\&                                 253\-253  G
.Ve
.PP
The low-complexity sequences are on the left (lower case) and
high-complexity sequences are on the right (upper case).  All
sequence segments read from left to right and their order in the
sequence is from top to bottom, as shown by the central column of
residue numbers.
.PP
The command line:
.PP
.Vb 1
\&  ncbi\-seg /usr/share/doc/ncbi\-seg/examples/prion.fa \-x
.Ve
.PP
gives the following FASTA-formatted file:\-
.PP
.Vb 6
\& >PRIO_HUMAN MAJOR PRION PROTEIN PRECURSOR
\& MANLGCWMLVLFVATWSDLGLCKKRPKPGGWNTGGSRYPGQGSPGGNRYxxxxxxxxxxx
\& xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxTHSQWNKPSKPKTNMKHMxxxxxxxx
\& xxxxxxxxxxxxxxxRPIIHFGSDYEDRYYRENMHRYPNQVYYRPMDEYSNQNNFVHDCV
\& NITIKQHxxxxxxxxxxxxxxDVKMMERVVEQMCITQYERESQAYYQRGSSMVLFSxxxx
\& xxxxxxxxxxxxG
.Ve
.SH "SEE ALSO"
.IX Header "SEE ALSO"
\&\fBsegn\fR\|(1), \fBblast\fR\|(1), \fBsaps\fR\|(1), \fBxnu\fR\|(1)
.SH "AUTHORS"
.IX Header "AUTHORS"
John Wootton:     wootton@ncbi.nlm.nih.gov
.PP
Scott Federhen:   federhen@ncbi.nlm.nih.gov
.PP
.Vb 6
\& National Center for Biotechnology Information
\& Building 38A, Room 8N805
\& National Library of Medicine
\& National Institutes of Health
\& Bethesda, Maryland, MD 20894
\& U.S.A.
.Ve
.SH "PRIMARY REFERENCE"
.IX Header "PRIMARY REFERENCE"
Wootton, J.C., Federhen, S. (1993)  Statistics of local complexity
in amino acid sequences and sequence  databases.  Computers &
Chemistry 17: 149\-163.
.SH "OTHER REFERENCES"
.IX Header "OTHER REFERENCES"
Wootton, J.C. (1994)  Non-globular domains in protein sequences:
automated segmentation using complexity measures.  Computers &
Chemistry 18: (in press).
.PP
Altschul, S.F., Boguski, M., Gish, W., Wootton, J.C. (1994)  Issues
in searching molecular sequence  databases.  Nature Genetics 6:
119\-129.
.PP
Wootton, J.C. (1994)  Simple sequences of protein and \s-1DNA.\s0 In:
Nucleic Acid and Protein Sequence  Analysis: A Practical Approach.
(Second Edition, Chapter 8, Bishop, M.J. and Rawlings, C.R. Eds.
\&\s-1IRL\s0  Press, Oxford) (In press).