574 lines
34 KiB
HTML
574 lines
34 KiB
HTML
<html>
|
|
|
|
<head>
|
|
|
|
<title>The Statistics of Sequence Similarity Scores</title>
|
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|
|
|
<META NAME="keywords" CONTENT="sequence analysis, BLAST, Altschul, Cold Sping Harbor, statistics, sequence similarity">
|
|
|
|
<META NAME="description" CONTENT="insert the description to be displayed by the search engine. Also searched by the search engine.">
|
|
|
|
<link rel="stylesheet" href="./ncbi.css">
|
|
|
|
</head>
|
|
|
|
|
|
|
|
|
|
|
|
<body bgcolor="#FFFFFF" background="GIFS/bkgd.gif" alt="" text="#000000" link="#000099" vlink="#6666CC">
|
|
|
|
<span class="TEXT"> <!-- the header -->
|
|
|
|
<table border="0" width="600" cellspacing="0" cellpadding="0">
|
|
|
|
<tr>
|
|
|
|
<td width="140"><a href="https://www.ncbi.nlm.nih.gov"> <img src="GIFS/left.GIF" alt="NCBI" width="130" height="45" border="0"></a></td>
|
|
|
|
<td width="360" class="head1" valign="BOTTOM"> <span class="H1">The Statistics of Sequence Similarity Scores</span></td>
|
|
|
|
<td width="100" valign="MIDDLE"><A HREF="http://www.cshl.org/"><IMG SRC="GIFS/CSH.gif" ALT="CSH" ALIGN=BOTTOM WIDTH="45" HEIGHT="45" BORDER="0"></A></td>
|
|
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
<!-- the quicklinks bar -->
|
|
|
|
<table CLASS="TEXT" border="0" width="600" cellspacing="0" cellpadding="3" bgcolor="#000000">
|
|
|
|
<tr CLASS="TEXT" align="CENTER">
|
|
|
|
<td width="170"><a href="Altschul-1.html" class="BAR">The statistics of <BR>sequence similarity scores</a></td>
|
|
|
|
<td width="170"><a href="Altschul-3.html" class="BAR">The statistics of <BR>PSI-BLAST scores</a></td>
|
|
|
|
<td width="170"><a href="Altschul-2.html" class="BAR">Iterated profile searches <BR>with PSI-BLAST</a></td>
|
|
|
|
<td width="90"><a href="https://blast.ncbi.nlm.nih.gov/" class="BAR">BLAST<BR>Home</a></td>
|
|
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
<!-- the contents -->
|
|
|
|
<table border="0" width="600" cellspacing="0" cellpadding="0">
|
|
|
|
<tr valign="TOP"> <!-- left column -->
|
|
|
|
<td width="125">
|
|
|
|
<p> </p>
|
|
<span class="GUTTER1"><a href="#head1" class="GUTTER">The statistics of global sequence comparison</a><BR><BR>
|
|
|
|
<span class="GUTTER1"><a href="#head2" class="GUTTER">The statistics of local sequence comparison</a><BR><BR>
|
|
|
|
<span class="GUTTER1"><a href="#head3" class="GUTTER">Bit scores</a><BR><BR>
|
|
|
|
<span class="GUTTER1"><a href="#head4" class="GUTTER">P-values</a><BR><BR>
|
|
|
|
<span class="GUTTER1"><a href="#head5" class="GUTTER">Database searches</a><BR><BR>
|
|
|
|
<span class="GUTTER1"><a href="#head6" class="GUTTER">The statistics of gapped alignments</a><BR><BR>
|
|
|
|
<span class="GUTTER1"><a href="#head7" class="GUTTER">Edge effects</a><BR><BR>
|
|
|
|
<span class="GUTTER1"><a href="#head8" class="GUTTER">The choice of substitution scores</a><BR><BR>
|
|
|
|
<span class="GUTTER1"><a href="#head9" class="GUTTER">The PAM and BLOSUM amino acid substitution matrices</a><BR><BR>
|
|
|
|
<span class="GUTTER1"><a href="#head10" class="GUTTER">DNA substitution matrices</a><BR><BR>
|
|
|
|
<span class="GUTTER1"><a href="#head11" class="GUTTER">Gap scores</a><BR><BR>
|
|
|
|
<span class="GUTTER1"><a href="#head12" class="GUTTER">Low complexity sequence regions</a><BR><BR>
|
|
|
|
<span class="GUTTER1"><a href="#refs" class="GUTTER">References</a><BR><BR>
|
|
|
|
|
|
</td>
|
|
|
|
<!-- extra column to force things over the gif border -->
|
|
|
|
<td width="15"> </td>
|
|
|
|
<!-- right content column -->
|
|
|
|
<td class="TEXTWIDE" width="460">
|
|
|
|
<p> </p>
|
|
|
|
<!-- title with bullet -->
|
|
|
|
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14">Introduction</h3>
|
|
|
|
<!-- end of title with bullet -->
|
|
|
|
<SPAN CLASS=TEXTWIDE>
|
|
|
|
To assess whether a given alignment constitutes evidence for homology, it
|
|
helps to know how strong an alignment can be expected from chance alone.
|
|
In this context, "chance" can mean the comparison of (i) real but non-homologous sequences; (ii) real sequences that are shuffled to preserve
|
|
compositional properties <A HREF="#ref1">[1-3]</A>; or (iii) sequences that are generated
|
|
randomly based upon a DNA or protein sequence model. Analytic statistical
|
|
results invariably use the last of these definitions of chance, while
|
|
empirical results based on simulation and curve-fitting may use any of
|
|
the definitions.<BR>
|
|
|
|
|
|
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name="head1">The statistics of global sequence comparison</A></h3>
|
|
|
|
Unfortunately, under even the simplest random models and scoring systems,
|
|
very little is known about the random distribution of optimal global
|
|
alignment scores <A HREF="#ref4">[4]</A>. Monte Carlo experiments can provide rough
|
|
distributional results for some specific scoring systems and sequence
|
|
compositions <A HREF="#ref5">[5]</A>, but these can not be generalized easily. Therefore,
|
|
one of the few methods available for assessing the statistical significance
|
|
of a particular global alignment is to generate many random sequence
|
|
pairs of the appropriate length and composition, and calculate the
|
|
optimal alignment score for each <A HREF="#ref1">[1,3]</A>. While it is then possible to
|
|
express the score of interest in terms of standard deviations from the
|
|
mean, it is a mistake to assume that the relevant distribution is normal
|
|
and convert this <I>Z</I>-value into a <I>P</I>-value; the tail behavior of global
|
|
alignment scores is unknown. The most one can say reliably is that if
|
|
100 random alignments have score inferior to the alignment of interest,
|
|
the <I>P</I>-value in question is likely less than 0.01. One further pitfall
|
|
to avoid is exaggerating the significance of a result found among multiple
|
|
tests. When many alignments have been generated, e.g. in a database
|
|
search, the significance of the best must be discounted accordingly.
|
|
An alignment with <I>P</I>-value 0.0001 in the context of a single trial may
|
|
be assigned a <I>P</I>-value of only 0.1 if it was selected as the best among
|
|
1000 independent trials.<BR>
|
|
|
|
|
|
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name="head2">The statistics of local sequence comparison</A></h3>
|
|
|
|
Fortunately statistics for the scores of local alignments, unlike those of
|
|
global alignments, are well understood. This is particularly true for local
|
|
alignments lacking gaps, which we will consider first. Such alignments were
|
|
precisely those sought by the original BLAST database search programs <A HREF="#ref6">[6]</A>.<BR>
|
|
|
|
A local alignment without gaps consists simply of a pair of equal length
|
|
segments, one from each of the two sequences being compared. A modification
|
|
of the Smith-Waterman <A HREF="#ref7">[7]</A> or Sellers <A HREF="#ref8">[8]</A> algorithms will find all segment
|
|
pairs whose scores can not be improved by extension or trimming. These are
|
|
called high-scoring segment pairs or HSPs.<BR>
|
|
|
|
To analyze how high a score is likely to arise by chance, a model of random
|
|
sequences is needed. For proteins, the simplest model chooses the amino acid
|
|
residues in a sequence independently, with specific background probabilities
|
|
for the various residues. Additionally, the expected score for aligning a
|
|
random pair of amino acid is required to be negative. Were this not the case,
|
|
long alignments would tend to have high score independently of whether the
|
|
segments aligned were related, and the statistical theory would break down.<BR>
|
|
|
|
Just as the sum of a large number of independent identically distributed
|
|
(i.i.d) random variables tends to a normal distribution, the maximum
|
|
of a large number of i.i.d. random variables tends to an extreme value
|
|
distribution <A HREF="#ref9">[9]</A>. (We will elide the many technical points required
|
|
to make this statement rigorous.) In studying optimal local sequence
|
|
alignments, we are essentially dealing with the latter case <A HREF="#ref10">[10,11]</A>.
|
|
In the limit of sufficiently large sequence lengths <I>m</I> and <I>n</I>, the
|
|
statistics of HSP scores are characterized by two parameters, <I>K</I> and
|
|
<I>lambda</I>. Most simply, the expected number of HSPs with score at least
|
|
<I>S</I> is given by the formula<BR>
|
|
|
|
<IMG SRC="GIFS/(1).gif" alt="Formula 1" WIDTH="460" HEIGHT="50" BORDER="0"><BR><BR><BR><BR>
|
|
|
|
We call this the <I>E</I>-value for the score <I>S</I>.<BR>
|
|
This formula makes eminently intuitive sense. Doubling the length of
|
|
either sequence should double the number of HSPs attaining a given score.
|
|
Also, for an HSP to attain the score <I>2x</I> it must attain the score <I>x</I> twice
|
|
in a row, so one expects <I>E</I> to decrease exponentially with score. The
|
|
parameters <I>K</I> and <I>lambda</I> can be thought of simply as natural scales for
|
|
the search space size and the scoring system respectively.<BR>
|
|
|
|
|
|
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name="head3">Bit scores</A></h3>
|
|
|
|
Raw scores have little meaning without detailed knowledge of the scoring
|
|
system used, or more simply its statistical parameters <I>K</I> and <I>lambda</I>.
|
|
Unless the scoring system is understood, citing a raw score alone is like citing a distance without specifying
|
|
feet, meters, or light years.
|
|
By normalizing a raw score using the formula<BR>
|
|
|
|
<IMG SRC="GIFS/(2).gif" alt="Formula 2" ALIGN=BOTTOM WIDTH="460" HEIGHT="65" BORDER="0"><BR><BR><BR><BR>
|
|
|
|
one attains a "bit score" <I>S'</I>, which has a standard set of units. The <I>E</I>-value
|
|
corresponding to a given bit score is simply<BR>
|
|
|
|
<IMG SRC="GIFS/(3).gif" alt="Formula 3" ALIGN=BOTTOM WIDTH="460" HEIGHT="65" BORDER="0"><BR><BR><BR><BR>
|
|
|
|
Bit scores subsume the statistical essence of the scoring system employed,
|
|
so that to calculate significance one needs to know in addition only the
|
|
size of the search space.<BR>
|
|
|
|
|
|
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name="head4">P-values</A></h3>
|
|
|
|
The number of random HSPs with score >= <I>S</I> is described by a Poisson
|
|
distribution <A HREF="#ref10">[10,11]</A>. This means that the probability of finding exactly
|
|
<I>a</I> HSPs with score >=<I>S</I> is given by<BR>
|
|
|
|
<IMG SRC="GIFS/(4).gif" alt="Formula 4" ALIGN=BOTTOM WIDTH="460" HEIGHT="65" BORDER="0"><BR><BR><BR><BR>
|
|
|
|
where <I>E</I> is the <I>E</I>-value of <I>S</I> given by equation (1) above. Specifically the
|
|
chance of finding zero HSPs with score >=<I>S</I> is e<SUP>-E</SUP>, so the probability
|
|
of finding at least one such HSP is<BR>
|
|
|
|
<IMG SRC="GIFS/(5).gif" alt="Formula 5" ALIGN=BOTTOM WIDTH="460" HEIGHT="50" BORDER="0"><BR><BR><BR><BR>
|
|
|
|
This is the <I>P</I>-value associated with the score <I>S</I>. For example, if one expects
|
|
to find three HSPs with score >= <I>S</I>, the probability of finding at least one
|
|
is 0.95. The BLAST programs report <I>E</I>-value rather than <I>P</I>-values because it
|
|
is easier to understand the difference between, for example, <I>E</I>-value of 5
|
|
and 10 than <I>P</I>-values of 0.993 and 0.99995. However, when <I>E</I> < 0.01, <I>P</I>-values
|
|
and <I>E</I>-value are nearly identical.<BR>
|
|
|
|
|
|
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name="head5">Database searches</A></h3>
|
|
|
|
The <I>E</I>-value of equation (1) applies to the comparison of two proteins of
|
|
lengths <I>m</I> and <I>n</I>. How does one assess the significance of an alignment that
|
|
arises from the comparison of a protein of length <I>m</I> to a database containing
|
|
many different proteins, of varying lengths? One view is that all proteins
|
|
in the database are <I>a priori</I> equally likely to be related to the query.
|
|
This implies that a low <I>E</I>-value for an alignment involving a short database
|
|
sequence should carry the same weight as a low <I>E</I>-value for an alignment
|
|
involving a long database sequence. To calculate a "database search" <I>E</I>-value,
|
|
one simply multiplies the pairwise-comparison <I>E</I>-value by the number of
|
|
sequences in the database. Recent versions of the FASTA protein comparison
|
|
programs <A HREF="#ref12">[12]</A> take this approach <A HREF="#ref13">[13]</A>.<BR>
|
|
|
|
An alternative view is that a query is <I>a priori</I> more likely to be related to
|
|
a long than to a short sequence, because long sequences are often composed of
|
|
multiple distinct domains. If we assume the <I>a priori</I> chance of relatedness is
|
|
proportional to sequence length, then the pairwise <I>E</I>-value involving a database
|
|
sequence of length <I>n</I> should be multiplied by <I>N/n</I>, where <I>N</I> is the total length
|
|
of the database in residues. Examining equation (1), this can be accomplished
|
|
simply by treating the database as a single long sequence of length <I>N</I>. The
|
|
BLAST programs <A HREF="#ref6">[6,14,15]</A> take this approach to calculating database <I>E</I>-value.
|
|
Notice that for DNA sequence comparisons, the length of database records is
|
|
largely arbitrary, and therefore this is the only really tenable method for
|
|
estimating statistical significance.<BR>
|
|
|
|
|
|
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name="head6">The statistics of gapped alignments</A></h3>
|
|
|
|
The statistics developed above have a solid theoretical foundation only
|
|
for local alignments that are not permitted to have gaps. However, many
|
|
computational experiments <A HREF="#ref14">[14-21]</A> and some analytic results <A HREF="#ref22">[22]</A> strongly
|
|
suggest that the same theory applies as well to gapped alignments. For
|
|
ungapped alignments, the statistical parameters can be calculated, using
|
|
analytic formulas, from the substitution scores and the background residue frequencies of the sequences being compared. For gapped alignments,
|
|
these parameters must be estimated from a large-scale comparison of
|
|
"random" sequences.<BR>
|
|
|
|
Some database search programs, such as FASTA <A HREF="#ref12">[12]</A> or various implementation
|
|
of the Smith-Waterman algorithm <A HREF="#ref7">[7]</A>, produce optimal local alignment scores
|
|
for the comparison of the query sequence to every sequence in the database.
|
|
Most of these scores involve unrelated sequences, and therefore can be used
|
|
to estimate <I>lambda</I> and <I>K</I> <A HREF="#ref17">[17,21]</A>. This approach avoids the artificiality of
|
|
a random sequence model by employing real sequences, with their attendant
|
|
internal structure and correlations, but it must face the problem of excluding
|
|
from the estimation scores from pairs of related sequences. The BLAST programs
|
|
achieve much of their speed by avoiding the calculation of optimal alignment
|
|
scores for all but a handful of unrelated sequences. The must therefore rely
|
|
upon a pre-estimation of the parameters <I>lambda</I> and <I>K</I>, for a selected set of
|
|
substitution matrices and gap costs. This estimation could be done using real
|
|
sequences, but has instead relied upon a random sequence model <A HREF="#ref14">[14]</A>, which
|
|
appears to yield fairly accurate results <A HREF="#ref21">[21]</A>.<BR>
|
|
|
|
|
|
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name="head7">Edge effects</A></h3>
|
|
|
|
The statistics described above tend to be somewhat conservative for short
|
|
sequences. The theory supporting these statistics is an asymptotic one,
|
|
which assumes an optimal local alignment can begin with any aligned pair
|
|
of residues. However, a high-scoring alignment must have some length,
|
|
and therefore can not begin near to the end of either of two sequences
|
|
being compared. This "edge effect" may be corrected for by calculating
|
|
an "effective length" for sequences <A HREF="#ref14">[14]</A>; the BLAST programs implement
|
|
such a correction. For sequences longer than about 200 residues the edge
|
|
effect correction is usually negligible.<BR>
|
|
|
|
|
|
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name="head8">The choice of substitution scores</A></h3>
|
|
|
|
The results a local alignment program produces depend strongly upon the
|
|
scores it uses. No single scoring scheme is best for all purposes, and
|
|
an understanding of the basic theory of local alignment scores can improve
|
|
the sensitivity of one's sequence analyses. As before, the theory is fully
|
|
developed only for scores used to find ungapped local alignments, so we
|
|
start with that case.<BR>
|
|
|
|
A large number of different amino acid substitution scores, based upon a
|
|
variety of rationales, have been described <A HREF="#ref23">[23-36]</A>. However the scores of
|
|
any substitution matrix with negative expected score can be written uniquely
|
|
in the form<BR>
|
|
|
|
<IMG SRC="GIFS/(6).gif" alt="Formula 6" ALIGN=BOTTOM WIDTH="460" HEIGHT="80" BORDER="0"><BR><BR><BR><BR><BR><BR>
|
|
|
|
where the <I>q<SUB>ij</SUB></I>, called target frequencies, are positive numbers that sum
|
|
to 1, the <I>p<SUB>i</SUB></I> are background frequencies for the various residues, and
|
|
<I>lambda</I> is a positive constant <A HREF="#ref10">[10,31]</A>. The <I>lambda</I> here is identical to the
|
|
<I>lambda</I> of equation (1).<BR>
|
|
|
|
Multiplying all the scores in a substitution matrix by a positive constant
|
|
does not change their essence: an alignment that was optimal using the
|
|
original scores remains optimal. Such multiplication alters the parameter
|
|
<I>lambda</I> but not the target frequencies <I>q<SUB>ij</SUB></I>. Thus, up to a constant
|
|
scaling factor, every substitution matrix is uniquely determined by its
|
|
target frequencies. These frequencies have a special significance <A HREF="#ref10">[10,31]</A>:<BR><BR>
|
|
|
|
<CENTER><TABLE WIDTH=400>
|
|
<TR><TD ><SPAN CLASS=TEXTWIDE>
|
|
A given class of alignments is best distinguished from chance by the
|
|
substitution matrix whose target frequencies characterize the class.
|
|
</SPAN></TD></TR>
|
|
</TABLE></CENTER><BR>
|
|
|
|
To elaborate, one may characterize a set of alignments representing homologous
|
|
protein regions by the frequency with which each possible pair of residues is
|
|
aligned. If valine in the first sequence and leucine in the second appear in
|
|
1% of all alignment positions, the target frequency for (valine, leucine) is
|
|
0.01. The most direct way to construct appropriate substitution matrices for
|
|
local sequence comparison is to estimate target and background frequencies,
|
|
and calculate the corresponding log-odds scores of formula (6). These
|
|
frequencies in general can not be derived from first principles, and their
|
|
estimation requires empirical input.<BR>
|
|
|
|
|
|
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name="head9">The PAM and BLOSUM amino acid substitution matrices</A></h3>
|
|
|
|
While all substitution matrices are implicitly of log-odds form, the first
|
|
explicit construction using formula (6) was by Dayhoff and coworkers <A HREF="#ref24">[24,25]</A>. From a study of observed residue replacements in closely related proteins,
|
|
they constructed the PAM (for "point accepted mutation") model of molecular
|
|
evolution. One "PAM" corresponds to an average change in 1% of all amino
|
|
acid positions. After 100 PAMs of evolution, not every residue will have
|
|
changed: some will have mutated several times, perhaps returning to their
|
|
original state, and others not at all. Thus it is possible to recognize as
|
|
homologous proteins separated by much more than 100 PAMs. Note that there
|
|
is no general correspondence between PAM distance and evolutionary time, as
|
|
different protein families evolve at different rates.<BR>
|
|
|
|
Using the PAM model, the target frequencies and the corresponding substitution
|
|
matrix may be calculated for any given evolutionary distance. When two
|
|
sequences are compared, it is not generally known a priori what evolutionary
|
|
distance will best characterize any similarity they may share. Closely
|
|
related sequences, however, are relatively easy to find even will non-optimal
|
|
matrices, so the tendency has been to use matrices tailored for fairly distant
|
|
similarities. For many years, the most widely used matrix was PAM-250,
|
|
because it was the only one originally published by Dayhoff.<BR>
|
|
|
|
Dayhoff's formalism for calculating target frequencies has been criticized
|
|
<A HREF="#ref27">[27]</A>, and there have been several efforts to update her numbers using the
|
|
vast quantities of derived protein sequence data generated since her work
|
|
<A HREF="#ref33">[33,35]</A>. These newer PAM matrices do not differ greatly from the original
|
|
ones <A HREF="#ref37">[37]</A>.<BR>
|
|
|
|
An alternative approach to estimating target frequencies, and the corresponding
|
|
log-odds matrices, has been advanced by Henikoff & Henikoff <A HREF="#ref34">[34]</A>. They examine
|
|
multiple alignments of distantly related protein regions directly, rather than
|
|
extrapolate from closely related sequences. An advantage of this approach is
|
|
that it cleaves closer to observation; a disadvantage is that it yields no
|
|
evolutionary model. A number of tests <A HREF="#ref13">[13,37]</A> suggest that the "BLOSUM"
|
|
matrices produced by this method generally are superior to the PAM matrices
|
|
for detecting biological relationships.<BR>
|
|
|
|
|
|
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name="head10">DNA substitution matrices</A></h3>
|
|
|
|
While we have discussed substitution matrices only in the context of protein sequence comparison, all the main issues carry over to DNA sequence comparison.
|
|
One warning is that when the sequences of interest code for protein, it is almost always better to compare the protein translations than to compare the DNA sequences directly.
|
|
The reason is that after only a small amount of evolutionary change, the DNA sequences, when compared using simple nucleotide substitution scores, contain less
|
|
information with which to deduce homology than do the encoded protein sequences
|
|
<A HREF="#ref32">[32]</A>.<BR>
|
|
Sometimes, however, one may wish to compare non-coding DNA sequences, at which point the same log-odds approach as before applies.
|
|
An evolutionary model in which all nucleotides are equally common and all substitution mutations are equally likely yields different scores only for matches and mismatches <A HREF="#ref32">[32]</A>.
|
|
A more complex model, in which transitions are more likely than transversions, yields different "mismatch" scores for transitions and transversions <A HREF="#ref32">[32]</A>.
|
|
The best scores to use will depend upon whether one is seeking relatively diverged or closely related sequences <A HREF="#ref32">[32]</A>.<BR>
|
|
|
|
|
|
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name="head11">Gap scores</A></h3>
|
|
|
|
Our theoretical development concerning the optimality of matrices constructed
|
|
using equation (6) unfortunately is invalid as soon as gaps and associated gap
|
|
scores are introduced, and no more general theory is available to take its
|
|
place. However, if the gap scores employed are sufficiently large, one can
|
|
expect that the optimal substitution scores for a given application will not
|
|
change substantially. In practice, the same substitution scores have been
|
|
applied fruitfully to local alignments both with and without gaps. Appropriate
|
|
gap scores have been selected over the years by trial and error <A HREF="#ref13">[13]</A>, and most
|
|
alignment programs will have a default set of gap scores to go with a default
|
|
set of substitution scores. If the user wishes to employ a different set of
|
|
substitution scores, there is no guarantee that the same gap scores will remain
|
|
appropriate. No clear theoretical guidance can be given, but "affine gap
|
|
scores" <A HREF="#ref38">[38-41]</A>, with a large penalty for opening a gap and a much smaller
|
|
one for extending it, have generally proved among the most effective.<BR>
|
|
|
|
|
|
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name = "head12">Low complexity sequence regions</A></h3>
|
|
|
|
There is one frequent case where the random models and therefore the statistics
|
|
discussed here break down. As many as one fourth of all residues in protein
|
|
sequences occur within regions with highly biased amino acid composition.
|
|
Alignments of two regions with similarly biased composition may achieve very
|
|
high scores that owe virtually nothing to residue order but are due instead
|
|
to segment composition. Alignments of such "low complexity" regions have
|
|
little meaning in any case: since these regions most likely arise by gene
|
|
slippage, the one-to-one residue correspondence imposed by alignment is
|
|
not valid. While it is worth noting that two proteins contain similar low
|
|
complexity regions, they are best excluded when constructing alignments
|
|
<A HREF="#ref42">[42-44]</A>. The BLAST programs employ the SEG algorithm <A HREF="#ref43">[43]</A> to filter low
|
|
complexity regions from proteins before executing a database search.<BR>
|
|
|
|
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name = "refs">References</A></h3>
|
|
|
|
|
|
<A NAME="ref1">[1]</A> Fitch, W.M. (1983) "Random sequences." J. Mol. Biol. 163:171-176. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/6842586">(PubMed)</A><BR><BR>
|
|
<A NAME="ref2">[2]</A> Lipman, D.J., Wilbur, W.J., Smith T.F. & Waterman, M.S. (1984) "On the
|
|
statistical significance of nucleic acid similarities." Nucl. Acids Res.
|
|
12:215-226. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/6694902">(PubMed)</A><BR><BR>
|
|
<A NAME="ref3">[3]</A>
|
|
Altschul, S.F. & Erickson, B.W. (1985) "Significance of nucleotide sequence
|
|
alignments: a method for random sequence permutation that preserves
|
|
dinucleotide and codon usage." Mol. Biol. Evol. 2:526-538. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3870875">(PubMed)</A><BR><BR>
|
|
<A NAME="ref4">[4]</A> Deken, J. (1983) "Probabilistic behavior of longest-common-subsequence
|
|
length." In "Time Warps, String Edits and Macromolecules: The Theory and
|
|
Practice of Sequence Comparison." D. Sankoff & J.B. Kruskal (eds.),
|
|
pp. 55-91, Addison-Wesley, Reading, MA. <BR><BR>
|
|
|
|
<A NAME="ref5">[5]</A> Reich, J.G., Drabsch, H. & Daumler, A. (1984) "On the statistical
|
|
assessment of similarities in DNA sequences." Nucl. Acids Res.
|
|
12:5529-5543. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/6462914">(PubMed)</A><BR><BR>
|
|
<A NAME="ref6">[6]</A> Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990)
|
|
"Basic local alignment search tool." J. Mol. Biol. 215:403-410. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/2231712">(PubMed)</A><BR><BR>
|
|
<A NAME="ref7">[7]</A> Smith, T.F. & Waterman, M.S. (1981) "Identification of common molecular
|
|
subsequences." J. Mol. Biol. 147:195-197. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/7265238">(PubMed)</A><BR><BR>
|
|
<A NAME="ref8">[8]</A> Sellers, P.H. (1984) "Pattern recognition in genetic sequences by mismatch
|
|
density." Bull. Math. Biol. 46:501-514.<BR><BR>
|
|
|
|
<A NAME="ref9">[9]</A> Gumbel, E. J. (1958) "Statistics of extremes." Columbia University Press,
|
|
New York, NY.<BR><BR>
|
|
|
|
<A NAME="ref10">[10]</A> Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical
|
|
significance of molecular sequence features by using general scoring
|
|
schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268.<A HREF="https://www.ncbi.nlm.nih.gov/pubmed/2315319">(PubMed)</A><BR><BR>
|
|
<A NAME="ref11">[11]</A> Dembo, A., Karlin, S. & Zeitouni, O. (1994) "Limit distribution of maximal
|
|
non-aligned two-sequence segmental score." Ann. Prob. 22:2022-2039.<BR><BR>
|
|
|
|
<A NAME="ref12">[12]</A> Pearson, W.R. & Lipman, D.J. (1988) Improved tools for biological sequence
|
|
comparison." Proc. Natl. Acad. Sci. USA 85:2444-2448. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3162770">(PubMed)</A><BR><BR>
|
|
<A NAME="ref13">[13]</A> Pearson, W.R. (1995) "Comparison of methods for searching protein sequence
|
|
databases." Prot. Sci. 4:1145-1160. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/7549879">(PubMed)</A><BR><BR>
|
|
<A NAME="ref14">[14]</A> Altschul, S.F. & Gish, W. (1996) "Local alignment statistics." Meth.
|
|
Enzymol. 266:460-480. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/8743700">(PubMed)</A><BR><BR>
|
|
<A NAME="ref15">[15]</A> Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of
|
|
protein database search programs." Nucleic Acids Res. 25:3389-3402. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/9254694">(PubMed)</A><BR><BR>
|
|
<A NAME="ref16">[16]</A> Smith, T.F., Waterman, M.S. & Burks, C. (1985) "The statistical
|
|
distribution of nucleic acid similarities." Nucleic Acids Res. 13:645-656. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3871073">(PubMed)</A><BR><BR>
|
|
<A NAME="ref17">[17]</A> Collins, J.F., Coulson, A.F.W. & Lyall, A. (1988) "The significance of
|
|
protein sequence similarities." Comput. Appl. Biosci. 4:67-71. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3383005">(PubMed)</A><BR><BR>
|
|
<A NAME="ref18">[18]</A> Mott, R. (1992) "Maximum-likelihood estimation of the statistical
|
|
distribution of Smith-Waterman local sequence similarity scores." Bull.
|
|
Math. Biol. 54:59-75. <BR><BR>
|
|
|
|
<A NAME="ref19">[19]</A> Waterman, M.S. & Vingron, M. (1994) "Rapid and accurate estimates of
|
|
statistical significance for sequence database searches." Proc. Natl. Acad.
|
|
Sci. USA 91:4625-4628. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/8197109">(PubMed)</A><BR><BR>
|
|
<A NAME="ref20">[20]</A> Waterman, M.S. & Vingron, M. (1994) "Sequence comparison significance and
|
|
Poisson approximation." Stat. Sci. 9:367-381.<BR><BR>
|
|
|
|
<A NAME="ref21">[21]</A> Pearson, W.R. (1998) "Empirical statistical estimates for sequence
|
|
similarity searches." J. Mol. Biol. 276:71-84. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/9514730">(PubMed)</A><BR><BR>
|
|
<A NAME="ref22">[22]</A> Arratia, R. & Waterman, M.S. (1994) "A phase transition for the score in
|
|
matching random sequences allowing deletions." Ann. Appl. Prob. 4:200-225.<BR><BR>
|
|
|
|
<A NAME="ref23">[23]</A> McLachlan, A.D. (1971) "Tests for comparing related amino-acid sequences.
|
|
Cytochrome c and cytochrome c-551." J. Mol. Biol. 61:409-424. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/5167087">(PubMed)</A><BR><BR>
|
|
<A NAME="ref24">[24]</A> Dayhoff, M.O., Schwartz, R.M. & Orcutt, B.C. (1978) "A model of
|
|
evolutionary change in proteins." In "Atlas of Protein Sequence and
|
|
Structure," Vol. 5, Suppl. 3 (ed. M.O. Dayhoff), pp. 345-352. Natl. Biomed. Res. Found., Washington, DC.<BR><BR>
|
|
|
|
<A NAME="ref25">[25]</A> Schwartz, R.M. & Dayhoff, M.O. (1978) "Matrices for detecting distant
|
|
relationships." In "Atlas of Protein Sequence and Structure," Vol. 5,
|
|
Suppl. 3 (ed. M.O. Dayhoff), p. 353-358. Natl. Biomed. Res. Found.,
|
|
Washington, DC.<BR><BR>
|
|
|
|
<A NAME="ref26">[26]</A> Feng, D.F., Johnson, M.S. & Doolittle, R.F. (1984) "Aligning amino acid
|
|
sequences: comparison of commonly used methods." J. Mol. Evol. 21:112-125. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/6100188">(PubMed)</A><BR><BR>
|
|
<A NAME="ref27">[27]</A> Wilbur, W.J. (1985) "On the PAM matrix model of protein evolution." Mol.
|
|
Biol. Evol. 2:434-447. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3870870">(PubMed)</A><BR><BR>
|
|
<A NAME="ref28">[28]</A> Taylor, W.R. (1986) "The classification of amino acid conservation."
|
|
J. Theor. Biol. 119:205-218. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3461222">(PubMed)</A><BR><BR>
|
|
<A NAME="ref29">[29]</A> Rao, J.K.M. (1987) "New scoring matrix for amino acid residue exchanges
|
|
based on residue characteristic physical parameters." Int. J. Peptide
|
|
Protein Res. 29:276-281. <BR><BR>
|
|
|
|
<A NAME="ref30">[30]</A> Risler, J.L., Delorme, M.O., Delacroix, H. & Henaut, A. (1988) "Amino acid
|
|
substitutions in structurally related proteins. A pattern recognition
|
|
approach. Determination of a new and efficient scoring matrix." J. Mol.
|
|
Biol. 204:1019-1029. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3221397">(PubMed)</A><BR><BR>
|
|
<A NAME="ref31">[31]</A> Altschul, S.F. (1991) "Amino acid substitution matrices from an information
|
|
theoretic perspective." J. Mol. Biol. 219:555-565. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/2051488">(PubMed)</A><BR><BR>
|
|
<A NAME="ref32">[32]</A> States, D.J., Gish, W. & Altschul, S.F. (1991) "Improved sensitivity
|
|
of nucleic acid database searches using application-specific scoring
|
|
matrices." Methods 3:66-70. <BR><BR>
|
|
|
|
<A NAME="ref33">[33]</A> Gonnet, G.H., Cohen, M.A. & Benner, S.A. (1992) "Exhaustive matching of the
|
|
entire protein sequence database." Science 256:1443-1445. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/1604319">(PubMed)</A><BR><BR>
|
|
<A NAME="ref34">[34]</A> Henikoff, S. & Henikoff, J.G. (1992) "Amino acid substitution matrices from
|
|
protein blocks." Proc. Natl. Acad. Sci. USA 89:10915-10919. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/1438297">(PubMed)</A><BR><BR>
|
|
<A NAME="ref35">[35]</A> Jones, D.T., Taylor, W.R. & Thornton, J.M. (1992) "The rapid generation of
|
|
mutation data matrices from protein sequences." Comput. Appl. Biosci.
|
|
8:275-282. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/1633570">(PubMed)</A><BR><BR>
|
|
<A NAME="ref36">[36]</A> Overington, J., Donnelly, D., Johnson M.S., Sali, A. & Blundell, T.L.
|
|
(1992) "Environment-specific amino acid substitution tables: Tertiary
|
|
templates and prediction of protein folds." Prot. Sci. 1:216-226. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/1304904">(PubMed)</A><BR><BR>
|
|
<A NAME="ref37">[37]</A> Henikoff, S. & Henikoff, J.G. (1993) "Performance evaluation of amino acid
|
|
substitution matrices." Proteins 17:49-61. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/8234244">(PubMed)</A><BR><BR>
|
|
<A NAME="ref38">[38]</A> Gotoh, O. (1982) "An improved algorithm for matching biological sequences."
|
|
J. Mol. Biol. 162:705-708. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/7166760">(PubMed)</A><BR><BR>
|
|
<A NAME="ref39">[39]</A> Fitch, W.M. & Smith, T.F. (1983) "Optimal sequence alignments." Proc. Natl.
|
|
Acad. Sci. USA 80:1382-1386.<BR><BR>
|
|
|
|
<A NAME="ref40">[40]</A> Altschul, S.F. & Erickson, B.W. (1986) "Optimal sequence alignment using
|
|
affine gap costs." Bull. Math. Biol. 48:603-616. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3580642">(PubMed)</A><BR><BR>
|
|
<A NAME="ref41">[41]</A> Myers, E.W. & Miller, W. (1988) "Optimal alignments in linear space."
|
|
Comput. Appl. Biosci. 4:11-17. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3382986">(PubMed)</A><BR><BR>
|
|
<A NAME="ref42">[42]</A> Claverie, J.-M. & States, D.J. (1993) "Information enhancement methods for
|
|
large-scale sequence-analysis." Comput. Chem. 17:191-201.<BR><BR>
|
|
|
|
<A NAME="ref43">[43]</A> Wootton, J.C. & Federhen, S. (1993) "Statistics of local complexity in
|
|
amino acid sequences and sequence databases." Comput. Chem. 17:149-163.<BR><BR>
|
|
|
|
<A NAME="ref44">[44]</A> Altschul, S.F., Boguski, M.S., Gish, W. & Wootton, J.C. (1994) "Issues in
|
|
searching molecular sequence databases." Nature Genet. 6:119-129. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/8162065">(PubMed)</A><BR><BR>
|
|
|
|
</td>
|
|
|
|
|
|
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
<!-- end of content --> <!-- bottom of the page -->
|
|
|
|
</span>
|
|
|
|
</table>
|
|
|
|
|
|
|
|
|
|
|
|
</body>
|
|
|
|
</html>
|
|
|