272 lines
16 KiB
HTML
272 lines
16 KiB
HTML
<html>
|
|
|
|
<head>
|
|
|
|
<title>The Statistics of PSI-BLAST Scores</title>
|
|
|
|
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
|
|
|
|
<META NAME="keywords" CONTENT="sequence analysis, BLAST, Altschul, Cold Sping Harbor, statistics, sequence similarity">
|
|
|
|
<META NAME="description" CONTENT="insert the description to be displayed by the search engine. Also searched by the search engine.">
|
|
|
|
<link rel="stylesheet" href="./ncbi.css">
|
|
|
|
</head>
|
|
|
|
|
|
|
|
|
|
|
|
<body bgcolor="#FFFFFF" background="GIFS/bkgd.gif" alt="" text="#000000" link="#000099" vlink="#6666CC">
|
|
|
|
<span class="TEXT"> <!-- the header -->
|
|
|
|
<table border="0" width="600" cellspacing="0" cellpadding="0">
|
|
|
|
<tr>
|
|
|
|
<td width="140"><a href="https://www.ncbi.nlm.nih.gov"> <img src="GIFS/left.GIF" alt="NCBI" width="130" height="45" border="0"></a></td>
|
|
|
|
<td width="360" class="head1" valign="BOTTOM"> <span class="H1">The Statistics of PSI-BLAST Scores</span></td>
|
|
|
|
<td width="100" valign="MIDDLE"><A HREF="http://www.cshl.org/"><IMG SRC="GIFS/CSH.gif" ALT="CSH" ALIGN=BOTTOM WIDTH="45" HEIGHT="45" BORDER="0"></A></td>
|
|
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
<!-- the quicklinks bar -->
|
|
|
|
<table CLASS="TEXT" border="0" width="600" cellspacing="0" cellpadding="3" bgcolor="#000000">
|
|
|
|
<tr CLASS="TEXT" align="CENTER">
|
|
<td width="170"><a href="Altschul-1.html" class="BAR">The statistics of <BR>sequence similarity scores</a></td>
|
|
|
|
<td width="170"><a href="Altschul-3.html" class="BAR">The statistics of <BR>PSI-BLAST scores</a></td>
|
|
|
|
<td width="170"><a href="Altschul-2.html" class="BAR">Iterated profile searches <BR>with PSI-BLAST</a></td>
|
|
|
|
<td width="90"><a href="https://blast.ncbi.nlm.nih.gov/" class="BAR">BLAST<BR>Home</a></td>
|
|
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
<!-- the contents -->
|
|
|
|
<table border="0" width="600" cellspacing="0" cellpadding="0">
|
|
|
|
<tr valign="TOP"> <!-- left column -->
|
|
|
|
<td width="125">
|
|
|
|
<p> </p>
|
|
<span class="GUTTER1"><a href="#head1" class="GUTTER">Estimation of statistical parameters for local alignment scores</a><BR><BR>
|
|
|
|
<span class="GUTTER1"><a href="#head2" class="GUTTER">Generalization to PSI-BLAST alignment scores</a><BR><BR>
|
|
|
|
<span class="GUTTER1"><a href="#refs" class="GUTTER">References</a><BR><BR>
|
|
|
|
|
|
</td>
|
|
|
|
<!-- extra column to force things over the gif border -->
|
|
|
|
<td width="15"> </td>
|
|
|
|
<!-- right content column -->
|
|
|
|
<td class="TEXTWIDE" width="460">
|
|
|
|
<p> </p>
|
|
|
|
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name="head1">Estimation of statistical parameters for local alignment scores</A></h3>
|
|
|
|
As discussed previously, computation experiments <A HREF="#ref1">[1-8]</A> strongly suggest that
|
|
the optimal gapped local alignment scores produced by the Smith-Waterman
|
|
algorithm <A HREF="#ref9">[9]</A> and approximated by FASTA <A HREF="#ref10">[10]</A> or Gapped BLAST <A HREF="#ref6">[6,7]</A> follow
|
|
an extreme value distribution <A HREF="#ref11">[11]</A>. Specifically, the probability that the
|
|
optimal score <I>S</I> from the comparison of unrelated proteins is at least <I>x</I> is
|
|
given by the equation<BR>
|
|
|
|
<IMG SRC="GIFS/3.(1).gif" alt="Formula 1" WIDTH="460" HEIGHT="50" BORDER="0"><BR><BR><BR><BR>
|
|
|
|
where <I>K</I> and <I>lambda</I> are statistical parameters dependent upon the scoring
|
|
system and the background amino acid frequences of the sequences being
|
|
compared. While FASTA estimates these parameters from the scores generated
|
|
by actual database searches, BLAST estimates them beforehand for specific
|
|
scoring schemes by comparing many random sequences generated using a standard
|
|
protein amino acid composition <A HREF="#ref12">[12]</A>.<BR>
|
|
|
|
For example, using BLOSUM-62 amino acid substitution scores <A HREF="#ref13">[13]</A>, and affine
|
|
gap costs <A HREF="#ref14">[14-16]</A> in which a gap of length <I>k</I> is assigned a score of -(10 + <I>k</I>),
|
|
we generated 10,000 pairs of length-1000 random protein sequences, and used
|
|
the Smith-Waterman algorithm to calculate 10,000 optimal local alignment
|
|
scores. From these scores, <I>lambda</I> was estimated at 0.252 and <I>K</I> at 0.035
|
|
by the method of maximum-likelihood <A HREF="#ref17">[17]</A>. In general, given <I>M</I> samples from
|
|
an extreme value distribution, the ratio of the maximum-likelihood estimate
|
|
of <I>lambda</I> to its actual value is approximately normally distributed, with mean
|
|
1.0 and standard deviation 0.78/sqrt(M) <A HREF="#ref17">[17]</A>. Thus the standard error for our
|
|
estimate of <I>lambda</I> is about 0.002, or less than 1%.<BR>
|
|
|
|
A plot of the optimal local alignment scores generated in this example is
|
|
shown in <A HREF="Fig3.1.html" TARGET="NEW">Figure 1</A>, along with the maximum-likelihood extreme value curve.
|
|
The chi-squared goodness-of-fit test for these data, with 34 degrees of
|
|
freedom, is 25.6, which is lower than would be expected to occur by chance
|
|
87% of the time even were the theory precisely valid. To better illustrate
|
|
the theory's accuracy, <A HREF="Fig3.2.html" TARGET="NEW">Figure 2</A> shows <I>P</I>-values derived from formula (1), or
|
|
by integrating the curve in <A HREF="Fig3.1.html" TARGET="NEW">Figure 1</A>. The solid curve represents the data,
|
|
and the dotted curve the theory. It is difficult to see any space between
|
|
them.<BR>
|
|
|
|
|
|
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name="head2">Generalization to PSI-BLAST alignment scores</A></h3>
|
|
|
|
In order for PSI-BLAST to iterate automatically, it needs to be able to
|
|
generate accurate estimates of the statistical significance of the alignments
|
|
it produces. Unfortunately, there is no analytic theory with which to estimate
|
|
the statistical significance of a gapped local alignment of a profile and a
|
|
simple sequence. We may conjecture, however, that the empirical statistics
|
|
for optimal local alignments of two sequences may also apply to alignments
|
|
involving a profile and a sequence.<BR>
|
|
|
|
Even were this the case, we are faced with the problem of estimating the
|
|
statistical parameters <I>lambda</I> and <I>K</I> for profiles we have seen for the first
|
|
time. As before, this may be done by random simulation, but sufficiently
|
|
precise estimates require on the order of half an hour on a typical current
|
|
workstation. This is at least two orders of magnitude too long for use with
|
|
PSI-BLAST.<BR>
|
|
|
|
One hope is that if we construct the amino acid scores within each column
|
|
of a PSI-BLAST profile to the same "scale" <A HREF="#ref18">[18,19]</A> (i.e. with the same
|
|
ungapped <I>lambda</I>) as those for a standard amino acid substitution matrix,
|
|
and then use the same position-independent gap costs, the same gapped <I>lambda</I>
|
|
may result.<BR>
|
|
|
|
To review, for ungapped local alignments, any substitution matrix takes the
|
|
form<BR>
|
|
|
|
<IMG SRC="GIFS/3.(2).gif" alt="Formula 2" WIDTH="460" HEIGHT="65" BORDER="0"><BR><BR><BR><BR>
|
|
|
|
where the <I>q<SUB>ij</SUB></I> are the target frequencies for aligned pairs of amino acids,
|
|
the <I>p<SUB>i</SUB></I> are background frequencies, and the subscript for <I>lambda</I> indicates
|
|
it is the statistical parameter for ungapped local alignments <A HREF="#ref18">[18,19]</A>. For a
|
|
PSI-BLAST profile <A HREF="#ref7">[7]</A>, each column has its own unique set of amino acid target
|
|
frequencies <I>q<SUB>i</SUB></I>. Following (2), the amino acid scores for this column may
|
|
be constructed to the same "scale" by using the formula<BR>
|
|
|
|
<IMG SRC="GIFS/3.(3).gif" alt="Formula 3" WIDTH="460" HEIGHT="65" BORDER="0"><BR><BR><BR><BR>
|
|
|
|
The hope is that, given a specific set of gap costs, the gapped <I>lambda</I> for
|
|
the PSI-BLAST profile will be the same as the gapped <I>lambda</I> for the standard
|
|
substitution matrix, which may be calculated in advance.<BR>
|
|
|
|
To test this hypothesis, we searched SWISS-PROT <A HREF="#ref20">[20]</A>, release 34 (59,576
|
|
sequences; 21,219,450 residues) with the length-567 influenza A virus
|
|
hemagglutinin precursor <A HREF="#ref21">[21]</A> (SWISS-PROT accession number P03435) as query,
|
|
and captured the profile constructed by PSI-BLAST from the 128 local alignments
|
|
with <I>E</I>-value <= 0.01. This profile was constructed to have almost the same
|
|
ungapped lambda (i.e. 0.3176) as the BLOSUM-62 matrix. We then compared this
|
|
profile to 10,000 random sequences of length 567, generated using standard
|
|
background amino acid frequencies <A HREF="#ref12">[12]</A>. As before, a gap of length <I>k</I> was
|
|
assigned a score of -(10 + <I>k</I>). Counts of the optimal local alignment scores,
|
|
calculated using an appropriately modified version of the Smith-Waterman
|
|
algorithm, are plotted in <A HREF="Fig3.3.html" TARGET="NEW">Figure 3</A>. Also shown is the corresponding maximum-
|
|
likelihood extreme value distribution, which has statistical parameters
|
|
<I>lambda</I> = 0.254 and <I>K</I> = 0.040. It is apparent that the distribution fits the
|
|
random trial reasonably well; a chi-squared goodness-of-fit test with 34
|
|
degrees of freedom has value 39.7, which is lower than one would expect 27%
|
|
of the time even were the theory precisely valid.<BR>
|
|
|
|
This supports the idea that our statistical theory applies to PSI-BLAST
|
|
alignments. Furthermore, the estimate for gapped <I>lambda</I> of 0.254 +- 0.002
|
|
agrees to within experimental error with the value 0.252 +- 0.002 previously
|
|
estimated for BLOSUM-62, using the same gap costs (see <A HREF="table1.html" TARGET="NEW">Table 1</A>). Similar
|
|
agreement was obtained with a number of other protein sequences as initial
|
|
query (results not shown), and in all cases the much less important gapped <I>K</I>
|
|
parameter could be estimated accurately as well (<A HREF="table1.html" TARGET="NEW">Table 1</A>). In general, values
|
|
of <I>lambda</I> for local profile alignments appear to differ by less than 2% from
|
|
the values for simple local sequence comparison. Using these precomputed
|
|
values for gapped <I>lambda</I> should thus entail an error of less one bit for
|
|
PSI-BLAST scores under 50 bits, corresponding to a factor of less than two
|
|
in the estimation of statistical significance.<BR>
|
|
|
|
The accuracy of PSI-BLAST statistics was further tested by the
|
|
comparison of PSI-BLAST profiles with a shuffled version of SWISS-PROT.
|
|
The results from eleven queries are shown in <A HREF="table2.html" TARGET="NEW">Table 2</A>. For each, the query
|
|
was compared to the SWISS-PROT database, and a PSI-BLAST profile was
|
|
constructed from the resulting significant alignments. This profile was
|
|
then compared to a version of SWISS-PROT in which each sequence was shuffled.
|
|
The <I>E</I>-values predicted by the estimated PSI-BLAST parameters are pretty
|
|
close to the mark.<BR>
|
|
|
|
The accuracy of PSI-BLAST statistics might lead us to speculate that
|
|
the lambdas for gapped alignments are a function only of the scale of the
|
|
substitution matrix and the gap costs. Unfortunately, this is not the case,
|
|
as an examination of <A HREF="table3.html" TARGET="NEW">Table 3</A> demonstrates. It appears that the effect of
|
|
specific gap costs on <I>lambda</I> is a function at least of the relative entropy
|
|
of the substitution matrix, as well as its scale. The reason that PSI-BLAST
|
|
profile <I>lambdas</I> could be predicted so well by BLOSUM-62 is that these profiles
|
|
are constructed based upon BLOSUM-62 target frequencies, and therefore have
|
|
relative entropies roughly equivalent to that of BLOSUM-62. A more precise
|
|
theory of gapped alignment statistics, however, may be in the offing <A HREF="#ref22">[22]</A>.<BR>
|
|
|
|
<h3><img src="GIFS/bluebullet.gif" alt="" width="16" height="14"><A name = "refs">References</A></h3>
|
|
|
|
|
|
<A NAME="ref1">[1]</A> Smith, T.F., Waterman, M.S. & Burks, C. (1985) "The statistical
|
|
distribution of nucleic acid similarities." Nucleic Acids Res. 13:645-656. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3871073">(PubMed)</A><BR><BR><A NAME="ref2">[2]</A> Collins, J.F., Coulson, A.F.W. & Lyall, A. (1988) "The significance of
|
|
protein sequence similarities." Comput. Appl. Biosci. 4:67-71. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3383005">(PubMed)</A><BR><BR><A NAME="ref3">[3]</A> Mott, R. (1992) "Maximum-likelihood estimation of the statistical
|
|
distribution of Smith-Waterman local sequence similarity scores." Bull.
|
|
Math. Biol. 54:59-75. <BR><BR>
|
|
<A NAME="ref4">[4]</A> Waterman, M.S. & Vingron, M. (1994) "Rapid and accurate estimates of
|
|
statistical significance for sequence database searches." Proc. Natl. Acad. Sci. USA 91:4625-4628. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/8197109">(PubMed)</A><BR><BR><A NAME="ref5">[5]</A> Waterman, M.S. & Vingron, M. (1994) "Sequence comparison significance and
|
|
Poisson approximation." Stat. Sci. 9:367-381.<BR><BR>
|
|
<A NAME="ref6">[6]</A> Altschul, S.F. & Gish, W. (1996) "Local alignment statistics." Meth.
|
|
Enzymol. 266:460-480. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/8743700">(PubMed)</A><BR><BR><A NAME="ref7">[7]</A> Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. & Lipman, D.J. (1997) "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs." Nucleic Acids Res. 25:3389-3402. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/9254694">(PubMed)</A><BR><BR><A NAME="ref8">[8]</A> Pearson, W.R. (1998) "Empirical statistical estimates for sequence
|
|
similarity searches." J. Mol. Biol. 276:71-84. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/9514730">(PubMed)</A><BR><BR><A NAME="ref9">[9]</A> Smith, T.F. & Waterman, M.S. (1981) "Identification of common molecular
|
|
subsequences." J. Mol. Biol. 147:195-197. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/7265238">(PubMed)</A><BR><BR><A NAME="ref10">[10]</A> Pearson, W.R. & Lipman, D.J. (1988) "Improved tools for biological sequence
|
|
comparison." Proc. Natl. Acad. Sci. USA 85:2444-2448. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3162770">(PubMed)</A><BR><BR><A NAME="ref11">[11]</A> Gumbel, E.J. (1958) "Statistics of extremes." Columbia University Press,
|
|
New York, NY.<BR><BR>
|
|
<A NAME="ref12">[12]</A> Robinson, A.B. & Robinson, L.R. (1991) "Distribution of glutamine and
|
|
asparagine residues and their near neighbors in peptides and proteins."
|
|
Proc. Natl. Acad. Sci. USA 88:8880-8884. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/1924347">(PubMed)</A><BR><BR><A NAME="ref13">[13]</A> Henikoff, S. & Henikoff, J.G. (1992) "Amino acid substitution matrices from
|
|
protein blocks." Proc. Natl. Acad. Sci. USA 89:10915-10919. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/1438297">(PubMed)</A><BR><BR><A NAME="ref14">[14]</A> Fitch, W.M. & Smith, T.F. (1983) "Optimal sequence alignments." Proc. Natl. Acad. Sci. USA 80:1382-1386.<BR><BR>
|
|
<A NAME="ref15">[15]</A> Altschul, S.F. & Erickson, B.W. (1986) "Optimal sequence alignment using
|
|
affine gap costs." Bull. Math. Biol. 48:603-616. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3580642">(PubMed)</A><BR><BR><A NAME="ref16">[16]</A> Myers, E.W. & Miller, W. (1988) "Optimal alignments in linear space."
|
|
Comput. Appl. Biosci. 4:11-17. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/3382986">(PubMed)</A><BR><BR><A NAME="ref17">[17]</A> Lawless, J.F. (1982) "Statistical models and methods for lifetime data."
|
|
Chapter 4. John Wiley & Sons, New York, NY.<BR><BR>
|
|
<A NAME="ref18">[18]</A> Karlin, S. & Altschul, S.F. (1990) "Methods for assessing the statistical
|
|
significance of molecular sequence features by using general scoring
|
|
schemes." Proc. Natl. Acad. Sci. USA 87:2264-2268. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/2315319">(PubMed)</A><BR><BR><A NAME="ref19">[19]</A> Altschul, S.F. (1991) "Amino acid substitution matrices from an information theoretic perspective." J. Mol. Biol. 219:555-565. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/2051488">(PubMed)</A><BR><BR><A NAME="ref20">[20]</A> Bairoch, A. & Apweiler, R. (1998) "The SWISS-PROT protein sequence data
|
|
bank and its supplement TrEMBL in 1998." Nucleic Acids Res. 26:38-42. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/9016499">(PubMed)</A><BR><BR><A NAME="ref21">21]</A> Jou, W.M., Verhoeyen, M., Devos, R., Saman, E., Fang, R., Huylebroeck, D.,
|
|
Fiers, W., Threlfall, G., Barber, C., Carey, N. & Emtage, S. (1980)
|
|
"Complete structure of the hemagglutinin gene from the human influenza
|
|
A/Victoria/3/75 (H3N2) strain as determined from cloned DNA." Cell
|
|
19:683-696. <A HREF="https://www.ncbi.nlm.nih.gov/pubmed/6153930">(PubMed)</A><BR><BR><A NAME="ref22">[22]</A> Mott, R. & Tribe, R. "Approximate statistics for gapped alignments."
|
|
J. Comp. Biol., in press.<BR><BR>
|
|
|
|
|
|
</td>
|
|
|
|
|
|
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
<!-- end of content --> <!-- bottom of the page -->
|
|
|
|
</span>
|
|
|
|
</table>
|
|
|
|
|
|
|
|
|
|
|
|
</body>
|
|
|
|
</html>
|
|
|