nih-gov/www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-3.html

272 lines
16 KiB
HTML

<html>
<head>
<title>The Statistics of PSI-BLAST Scores</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<META NAME="keywords" CONTENT="sequence analysis, BLAST, Altschul, Cold Sping Harbor, statistics, sequence similarity">
<META NAME="description" CONTENT="insert the description to be displayed by the search engine. Also searched by the search engine.">
<link rel="stylesheet" href="./ncbi.css">
</head>
<body bgcolor="#FFFFFF" background="GIFS/bkgd.gif" alt="" text="#000000" link="#000099" vlink="#6666CC">
<span class="TEXT"> <!-- the header -->
<table border="0" width="600" cellspacing="0" cellpadding="0">
<tr>
<td width="140"><a href="https://www.ncbi.nlm.nih.gov"> <img src="GIFS/left.GIF" alt="NCBI" width="130" height="45" border="0"></a></td>
<td width="360" class="head1" valign="BOTTOM"> <span class="H1">The Statistics of PSI-BLAST Scores</span></td>
<td width="100" valign="MIDDLE"><A HREF="http://www.cshl.org/"><IMG SRC="GIFS/CSH.gif" ALT="CSH" ALIGN=BOTTOM WIDTH="45" HEIGHT="45" BORDER="0"></A></td>
</tr>
</table>
<!-- the quicklinks bar -->
<table CLASS="TEXT" border="0" width="600" cellspacing="0" cellpadding="3" bgcolor="#000000">
<tr CLASS="TEXT" align="CENTER">
<td width="170"><a href="Altschul-1.html" class="BAR">The statistics of <BR>sequence similarity scores</a></td>
<td width="170"><a href="Altschul-3.html" class="BAR">The statistics of <BR>PSI-BLAST scores</a></td>
<td width="170"><a href="Altschul-2.html" class="BAR">Iterated profile searches <BR>with PSI-BLAST</a></td>
<td width="90"><a href="https://blast.ncbi.nlm.nih.gov/" class="BAR">BLAST<BR>Home</a></td>
</tr>
</table>
<!-- the contents -->
<table border="0" width="600" cellspacing="0" cellpadding="0">
<tr valign="TOP"> <!-- left column -->
<td width="125">
<p>&nbsp;</p>
<span class="GUTTER1"><a href="#head1" class="GUTTER">Estimation of statistical parameters for local alignment scores</a><BR><BR>
<span class="GUTTER1"><a href="#head2" class="GUTTER">Generalization to PSI-BLAST alignment scores</a><BR><BR>
<span class="GUTTER1"><a href="#refs" class="GUTTER">References</a><BR><BR>
</td>
<!-- extra column to force things over the gif border -->
<td width="15"> </td>
<!-- right content column -->
<td class="TEXTWIDE" width="460">
<p>&nbsp;</p>
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name="head1">Estimation of statistical parameters for local alignment scores</A></h3>
&nbsp;&nbsp;&nbsp;As discussed previously, computation experiments <A HREF="#ref1">[1-8]</A> strongly suggest that
the optimal gapped local alignment scores produced by the Smith-Waterman
algorithm <A HREF="#ref9">[9]</A> and approximated by FASTA <A HREF="#ref10">[10]</A> or Gapped BLAST <A HREF="#ref6">[6,7]</A> follow
an extreme value distribution <A HREF="#ref11">[11]</A>. Specifically, the probability that the
optimal score <I>S</I> from the comparison of unrelated proteins is at least <I>x</I> is
given by the equation<BR>
<IMG SRC="GIFS/3.(1).gif" alt="Formula 1" WIDTH="460" HEIGHT="50" BORDER="0"><BR><BR><BR><BR>
where <I>K</I> and <I>lambda</I> are statistical parameters dependent upon the scoring
system and the background amino acid frequences of the sequences being
compared. While FASTA estimates these parameters from the scores generated
by actual database searches, BLAST estimates them beforehand for specific
scoring schemes by comparing many random sequences generated using a standard
protein amino acid composition <A HREF="#ref12">[12]</A>.<BR>
&nbsp;&nbsp;&nbsp;For example, using BLOSUM-62 amino acid substitution scores <A HREF="#ref13">[13]</A>, and affine
gap costs <A HREF="#ref14">[14-16]</A> in which a gap of length <I>k</I> is assigned a score of -(10 + <I>k</I>),
we generated 10,000 pairs of length-1000 random protein sequences, and used
the Smith-Waterman algorithm to calculate 10,000 optimal local alignment
scores. From these scores, <I>lambda</I> was estimated at 0.252 and <I>K</I> at 0.035
by the method of maximum-likelihood <A HREF="#ref17">[17]</A>. In general, given <I>M</I> samples from
an extreme value distribution, the ratio of the maximum-likelihood estimate
of <I>lambda</I> to its actual value is approximately normally distributed, with mean
1.0 and standard deviation 0.78/sqrt(M) <A HREF="#ref17">[17]</A>. Thus the standard error for our
estimate of <I>lambda</I> is about 0.002, or less than 1%.<BR>
&nbsp;&nbsp;&nbsp;A plot of the optimal local alignment scores generated in this example is
shown in <A HREF="Fig3.1.html" TARGET="NEW">Figure 1</A>, along with the maximum-likelihood extreme value curve.
The chi-squared goodness-of-fit test for these data, with 34 degrees of
freedom, is 25.6, which is lower than would be expected to occur by chance
87% of the time even were the theory precisely valid. To better illustrate
the theory's accuracy, <A HREF="Fig3.2.html" TARGET="NEW">Figure 2</A> shows <I>P</I>-values derived from formula (1), or
by integrating the curve in <A HREF="Fig3.1.html" TARGET="NEW">Figure 1</A>. The solid curve represents the data,
and the dotted curve the theory. It is difficult to see any space between
them.<BR>
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name="head2">Generalization to PSI-BLAST alignment scores</A></h3>
&nbsp;&nbsp;&nbsp;In order for PSI-BLAST to iterate automatically, it needs to be able to
generate accurate estimates of the statistical significance of the alignments
it produces. Unfortunately, there is no analytic theory with which to estimate
the statistical significance of a gapped local alignment of a profile and a
simple sequence. We may conjecture, however, that the empirical statistics
for optimal local alignments of two sequences may also apply to alignments
involving a profile and a sequence.<BR>
&nbsp;&nbsp;&nbsp;Even were this the case, we are faced with the problem of estimating the
statistical parameters <I>lambda</I> and <I>K</I> for profiles we have seen for the first
time. As before, this may be done by random simulation, but sufficiently
precise estimates require on the order of half an hour on a typical current
workstation. This is at least two orders of magnitude too long for use with
PSI-BLAST.<BR>
&nbsp;&nbsp;&nbsp;One hope is that if we construct the amino acid scores within each column
of a PSI-BLAST profile to the same "scale" <A HREF="#ref18">[18,19]</A> (i.e. with the same
ungapped <I>lambda</I>) as those for a standard amino acid substitution matrix,
and then use the same position-independent gap costs, the same gapped <I>lambda</I>
may result.<BR>
&nbsp;&nbsp;&nbsp;To review, for ungapped local alignments, any substitution matrix takes the
form<BR>
<IMG SRC="GIFS/3.(2).gif" alt="Formula 2" WIDTH="460" HEIGHT="65" BORDER="0"><BR><BR><BR><BR>
where the <I>q<SUB>ij</SUB></I> are the target frequencies for aligned pairs of amino acids,
the <I>p<SUB>i</SUB></I> are background frequencies, and the subscript for <I>lambda</I> indicates
it is the statistical parameter for ungapped local alignments <A HREF="#ref18">[18,19]</A>. For a
PSI-BLAST profile <A HREF="#ref7">[7]</A>, each column has its own unique set of amino acid target
frequencies <I>q<SUB>i</SUB></I>. Following (2), the amino acid scores for this column may
be constructed to the same "scale" by using the formula<BR>
<IMG SRC="GIFS/3.(3).gif" alt="Formula 3" WIDTH="460" HEIGHT="65" BORDER="0"><BR><BR><BR><BR>
The hope is that, given a specific set of gap costs, the gapped <I>lambda</I> for
the PSI-BLAST profile will be the same as the gapped <I>lambda</I> for the standard
substitution matrix, which may be calculated in advance.<BR>
&nbsp;&nbsp;&nbsp;To test this hypothesis, we searched SWISS-PROT <A HREF="#ref20">[20]</A>, release 34 (59,576
sequences; 21,219,450 residues) with the length-567 influenza A virus
hemagglutinin precursor <A HREF="#ref21">[21]</A> (SWISS-PROT accession number P03435) as query,
and captured the profile constructed by PSI-BLAST from the 128 local alignments
with <I>E</I>-value <= 0.01. This profile was constructed to have almost the same
ungapped lambda (i.e. 0.3176) as the BLOSUM-62 matrix. We then compared this
profile to 10,000 random sequences of length 567, generated using standard
background amino acid frequencies <A HREF="#ref12">[12]</A>. As before, a gap of length <I>k</I> was
assigned a score of -(10 + <I>k</I>). Counts of the optimal local alignment scores,
calculated using an appropriately modified version of the Smith-Waterman
algorithm, are plotted in <A HREF="Fig3.3.html" TARGET="NEW">Figure 3</A>. Also shown is the corresponding maximum-
likelihood extreme value distribution, which has statistical parameters
<I>lambda</I> = 0.254 and <I>K</I> = 0.040. It is apparent that the distribution fits the
random trial reasonably well; a chi-squared goodness-of-fit test with 34
degrees of freedom has value 39.7, which is lower than one would expect 27%
of the time even were the theory precisely valid.<BR>
&nbsp;&nbsp;&nbsp;This supports the idea that our statistical theory applies to PSI-BLAST
alignments. Furthermore, the estimate for gapped <I>lambda</I> of 0.254 +- 0.002
agrees to within experimental error with the value 0.252 +- 0.002 previously
estimated for BLOSUM-62, using the same gap costs (see <A HREF="table1.html" TARGET="NEW">Table 1</A>). Similar
agreement was obtained with a number of other protein sequences as initial
query (results not shown), and in all cases the much less important gapped <I>K</I>
parameter could be estimated accurately as well (<A HREF="table1.html" TARGET="NEW">Table 1</A>). In general, values
of <I>lambda</I> for local profile alignments appear to differ by less than 2% from
the values for simple local sequence comparison. Using these precomputed
values for gapped <I>lambda</I> should thus entail an error of less one bit for
PSI-BLAST scores under 50 bits, corresponding to a factor of less than two
in the estimation of statistical significance.<BR>
&nbsp;&nbsp;&nbsp;The accuracy of PSI-BLAST statistics was further tested by the
comparison of PSI-BLAST profiles with a shuffled version of SWISS-PROT.
The results from eleven queries are shown in <A HREF="table2.html" TARGET="NEW">Table 2</A>. For each, the query
was compared to the SWISS-PROT database, and a PSI-BLAST profile was
constructed from the resulting significant alignments. This profile was
then compared to a version of SWISS-PROT in which each sequence was shuffled.
The <I>E</I>-values predicted by the estimated PSI-BLAST parameters are pretty
close to the mark.<BR>
&nbsp;&nbsp;&nbsp;The accuracy of PSI-BLAST statistics might lead us to speculate that
the lambdas for gapped alignments are a function only of the scale of the
substitution matrix and the gap costs. Unfortunately, this is not the case,
as an examination of <A HREF="table3.html" TARGET="NEW">Table 3</A> demonstrates. It appears that the effect of
specific gap costs on <I>lambda</I> is a function at least of the relative entropy
of the substitution matrix, as well as its scale. The reason that PSI-BLAST
profile <I>lambdas</I> could be predicted so well by BLOSUM-62 is that these profiles
are constructed based upon BLOSUM-62 target frequencies, and therefore have
relative entropies roughly equivalent to that of BLOSUM-62. A more precise
theory of gapped alignment statistics, however, may be in the offing <A HREF="#ref22">[22]</A>.<BR>
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name = "refs">References</A></h3>
<A NAME="ref1">[1]</A> Smith, T.F., Waterman, M.S. & Burks, C. (1985) "The statistical
distribution of nucleic acid similarities." Nucleic Acids Res. 13:645-656. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3871073">(PubMed)</A><BR><BR><A NAME="ref2">[2]</A> Collins, J.F., Coulson, A.F.W. & Lyall, A. (1988) "The significance of
protein sequence similarities." Comput. Appl. Biosci. 4:67-71. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3383005">(PubMed)</A><BR><BR><A NAME="ref3">[3]</A> Mott, R. (1992) "Maximum-likelihood estimation of the statistical
distribution of Smith-Waterman local sequence similarity scores." Bull.
Math. Biol. 54:59-75. <BR><BR>
<A NAME="ref4">[4]</A> Waterman, M.S. & Vingron, M. (1994) "Rapid and accurate estimates of
statistical significance for sequence database searches." Proc. Natl. Acad. Sci. USA 91:4625-4628. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/8197109">(PubMed)</A><BR><BR><A NAME="ref5">[5]</A> Waterman, M.S. & Vingron, M. (1994) "Sequence comparison significance and
Poisson approximation." Stat. Sci. 9:367-381.<BR><BR>
<A NAME="ref6">[6]</A> Altschul, S.F. & Gish, W. (1996) "Local alignment statistics." Meth.
Enzymol. 266:460-480. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/8743700">(PubMed)</A><BR><BR><A NAME="ref7">[7]</A> Altschul, S.F., Madden, T.L., Sch&auml;ffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/9254694">(PubMed)</A><BR><BR><A NAME="ref8">[8]</A> Pearson, W.R. (1998) "Empirical statistical estimates for sequence
similarity searches." J. Mol. Biol. 276:71-84. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/9514730">(PubMed)</A><BR><BR><A NAME="ref9">[9]</A> Smith, T.F. & Waterman, M.S. (1981) "Identification of common molecular
subsequences." J. Mol. Biol. 147:195-197. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/7265238">(PubMed)</A><BR><BR><A NAME="ref10">[10]</A> Pearson, W.R. & Lipman, D.J. (1988) "Improved tools for biological sequence
comparison." Proc. Natl. Acad. Sci. USA 85:2444-2448. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3162770">(PubMed)</A><BR><BR><A NAME="ref11">[11]</A> Gumbel, E.J. (1958) "Statistics of extremes." Columbia University Press,
New York, NY.<BR><BR>
<A NAME="ref12">[12]</A> Robinson, A.B. & Robinson, L.R. (1991) "Distribution of glutamine and
asparagine residues and their near neighbors in peptides and proteins."
Proc. Natl. Acad. Sci. USA 88:8880-8884. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/1924347">(PubMed)</A><BR><BR><A NAME="ref13">[13]</A> Henikoff, S. & Henikoff, J.G. (1992) "Amino acid substitution matrices from
protein blocks." Proc. Natl. Acad. Sci. USA 89:10915-10919. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/1438297">(PubMed)</A><BR><BR><A NAME="ref14">[14]</A> Fitch, W.M. & Smith, T.F. (1983) "Optimal sequence alignments." Proc. Natl. Acad. Sci. USA 80:1382-1386.<BR><BR>
<A NAME="ref15">[15]</A> Altschul, S.F. & Erickson, B.W. (1986) "Optimal sequence alignment using
affine gap costs." Bull. Math. Biol. 48:603-616. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3580642">(PubMed)</A><BR><BR><A NAME="ref16">[16]</A> Myers, E.W. & Miller, W. (1988) "Optimal alignments in linear space."
Comput. Appl. Biosci. 4:11-17. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3382986">(PubMed)</A><BR><BR><A NAME="ref17">[17]</A> Lawless, J.F. (1982) "Statistical models and methods for lifetime data."
Chapter 4. John Wiley & Sons, New York, NY.<BR><BR>
<A NAME="ref18">[18]</A> Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical
significance of molecular sequence features by using general scoring
schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/2315319">(PubMed)</A><BR><BR><A NAME="ref19">[19]</A> Altschul, S.F. (1991) "Amino acid substitution matrices from an information theoretic perspective." J. Mol. Biol. 219:555-565. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/2051488">(PubMed)</A><BR><BR><A NAME="ref20">[20]</A> Bairoch, A. & Apweiler, R. (1998) "The SWISS-PROT protein sequence data
bank and its supplement TrEMBL in 1998." Nucleic Acids Res. 26:38-42. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/9016499">(PubMed)</A><BR><BR><A NAME="ref21">21]</A> Jou, W.M., Verhoeyen, M., Devos, R., Saman, E., Fang, R., Huylebroeck, D.,
Fiers, W., Threlfall, G., Barber, C., Carey, N. & Emtage, S. (1980)
"Complete structure of the hemagglutinin gene from the human influenza
A/Victoria/3/75 (H3N2) strain as determined from cloned DNA." Cell
19:683-696. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/6153930">(PubMed)</A><BR><BR><A NAME="ref22">[22]</A> Mott, R. & Tribe, R. "Approximate statistics for gapped alignments."
J. Comp. Biol., in press.<BR><BR>
</td>
</tr>
</table>
<!-- end of content --> <!-- bottom of the page -->
</span>
</table>
</body>
</html>